python and string encodings

May 21, 2013

I’ve recently finished the User accessible login reports project. After the initial roll-out to users I had a few reports of people getting server errors when certain sets of data were viewed. This website is written in Python and uses the Django framework. During the template processing stage we were getting error messages like the following:

DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 30: invalid continuation byte.

It appears that not all data coming from the whois service is encoded in the same way (see RFC 3912 for a discussion of the issue). In this case it was using a latin1 encoding but whois is quite an old service which has no support for declaring the content encoding used so we can never know what we are going to have to handle in advance.

A bit of searching around revealed the chardet module which can be used to automatically detect the encoding used in a string. So, I just added the following code and the problem was solved.

import chardet
enc = chardet.detect(val)['encoding']
if enc != 'utf-8':
    val = val.decode(enc)
val = val.encode('ascii','replace')

The final result is that I am guaranteed to have the string from whois as an ascii string with any unsupported characters replaced by a question mark (?). It’s not a perfect representation but it is web safe and is good enough for my needs.


SQL, ipython and pandas

March 30, 2013

I recently came across a really handy module which makes it easy to access data stored in an SQL DB from the ipython shell. Turns out that then going the next step and moving the data into pandas is very easy. All very cool, I love how easy it is to hack out code for quick data inspection and manipulation using ipython.


Using Python and Pandas to process data

March 16, 2013

I’ve recently been doing some data analysis for a presentation I will be giving at the FLOSS UK Spring Conference in Newcastle next week. This involved processing a lot of data gathered from our syslogs related to SSH authentications. As part of my ongoing effort to learn Python properly I decided to do all the work in that language. Whilst hunting around for useful modules for processing data and calculating various statistics I came across the very clever Pandas library which provides some impressive tools for processing tabulated data (such as that in CSV style files). It’s a bit of a steep learning curve but I’ve just come across a neat blog article which summarises the main functionality quite well. I’ve only used a few of the features so far, I particularly found the groupby functionality very handy, I shall definitely be exploring this library further in the future.