python and string encodings

I’ve recently finished the User accessible login reports project. After the initial roll-out to users I had a few reports of people getting server errors when certain sets of data were viewed. This website is written in Python and uses the Django framework. During the template processing stage we were getting error messages like the following:

DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 30: invalid continuation byte.

It appears that not all data coming from the whois service is encoded in the same way (see RFC 3912 for a discussion of the issue). In this case it was using a latin1 encoding but whois is quite an old service which has no support for declaring the content encoding used so we can never know what we are going to have to handle in advance.

A bit of searching around revealed the chardet module which can be used to automatically detect the encoding used in a string. So, I just added the following code and the problem was solved.

import chardet
enc = chardet.detect(val)['encoding']
if enc != 'utf-8':
    val = val.decode(enc)
val = val.encode('ascii','replace')

The final result is that I am guaranteed to have the string from whois as an ascii string with any unsupported characters replaced by a question mark (?). It’s not a perfect representation but it is web safe and is good enough for my needs.

Comments are closed.