Reference encoding error byte in Python – Education Career Blog

Suppose I type line = line.decode('gb18030;) and get the error

UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 142-143: illegal multibyte sequence

Is there a nice way to automatically get the error bytes? That is, is there a way to get 142 & 143 or line142:144 from a built-in command or module? Since I’m fairly confident that there will be only one such error, at most, per line, my first thought was along the lines of:

for i in range(len(line)):
    try:    
        linei.decode('gb18030')
    except UnicodeDecodeError:
        error = i

I don’t know how to say this correctly, but gb18030 has variable byte length so this method fails once it gets to a Chinese character (2 bytes).

,

try:
    line = line.decode('gb18030')
except UnicodeDecodeError, e:
    print "Error in bytes %d through %d" % (e.start, e.end)

,

Access the start and end attributes of the caught exception object.

u = u'áiuê©'
try:
  l = u.encode('latin-1')
  print repr(l)
  l.decode('utf-8')
except UnicodeDecodeError, e:
  print e
  print e.start, e.end

Leave a Comment