Suppose I type line = line.decode('gb18030;)
and get the error
UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 142-143: illegal multibyte sequence
Is there a nice way to automatically get the error bytes? That is, is there a way to get 142
& 143
or line142:144
from a built-in command or module? Since I’m fairly confident that there will be only one such error, at most, per line, my first thought was along the lines of:
for i in range(len(line)):
try:
linei.decode('gb18030')
except UnicodeDecodeError:
error = i
I don’t know how to say this correctly, but gb18030 has variable byte length so this method fails once it gets to a Chinese character (2 bytes).
,
try:
line = line.decode('gb18030')
except UnicodeDecodeError, e:
print "Error in bytes %d through %d" % (e.start, e.end)
,
Access the start
and end
attributes of the caught exception object.
u = u'áiuê©'
try:
l = u.encode('latin-1')
print repr(l)
l.decode('utf-8')
except UnicodeDecodeError, e:
print e
print e.start, e.end