Skip to content

Django Tips: UTF-8, ASCII Encoding Errors, Urllib2, and MySQL

by Paul Kenjora on September 24th, 2008

Having completed many Django projects over the past two years, I’ve started to take some seemingly trivial things for granted. Cross referencing a project for one of these solutions a few days back made me realize that these things are not trivial to others. I remember looking through countless pags and forums for the answers, only to find questions. The three main bits that were Django stumbling blocks for me are:

UTF-8

Trying to get UTF-8 working in my Django projects was always a nightmare. For some reason it was a black art trying to get all the pieces to work. I spent days twiddling settings, using "encode" and "decode" any which way imaginable. The problem is that UTF-8 encoding has to be handled correctly all the way through the project. At the entry point, when storing to DB, and when displayed to the user. Keep in mind that a UTF-8 is the final encoding you want in your DB, the source and output can and will be different. For encoding issues be methodical, do NOT take shortcuts, make sure your entry points convert correctly.

For any charset to utf-8 conversion all you need is: data.decode(“input_charset_here”).encode(‘utf-8′)

Also make sure your MySQL is set up correctly, covered below….

Urllib2

The urllib2 library in python is an excellent example of an input where encoding will be an issue. For the Arkayne project I pull in pages and analyze contents. Foreign pages with non ASCII encoding threw ASCII encode exceptions. I was exploiting the fact that ASCII is a subset of UTF-8 but that doesn’t work for long, its a hack. I tried using ignore for the encode function but I still got errors or missed entire pages. The urllib2, and urllib for that matter, library does not handle character set encodings by design. Urllib2 is built to connect to sockets and fetch data using the http protocol. Unfortunately there is no commonly used standard for passing http encoding so its not handled by the library.

There is a silver lining however, most servers do provide a Content-Type charset value. HTTP standards do spec charset but like I said not everyone uses it consistently. In this case however it is the best thing to go on if you are importing pages from the web. So in order to pull in a page from the web with the correct encoding using urllib2 and convert it to UTF-8 you do the following:

MySQL

This is a pesky one. MySQL has a serious problem in that the default charset is determined at compile time and it defaults to "latin1", also known as ASCII. Why default is not UTF-8, I do not know. So if you’ve already compiled and installed MySQL then every time you set up a DB there is a bit of extra work. There are ways to change the MySQL charset but they require stopping and starting the server with special options. If you have an existing database, it is already set to "latin1" so the steps above wont help anyway. You can also control MySQL charset encoding from within the client. Here is what I usually do to convert or set up a project correctly…

Set Up Project

Make sure to add this to your Django settings file: DEFAULT_CHARSET = ‘utf-8′

In mysql type: SET NAMES utf8;

After that type: create database db_name;

Run the Django syncdb utility.

Convert Existing Project

Dump the project database using: mysqldump -D db_name > db_name.mysql

Edit the dumped text file replacing: ENGINE=MyISAM DEFAULT CHARSET=latin1; with ENGINE=MyISAM DEFAULT CHARSET=utf8;

In MySQL drop the database you just dumped: drop database db_name;

Then type: SET NAMES utf8;

After that type: create database db_name;

Exit MySQL and import the edited dump file: mysql -D db_name < db_name.mysql

Final Thoughts

The world is becoming increasingly global and so is development. Most web applications today should expect non-ASCII characters. If you have not done so already, master the art of charset conversion. In my years at PayPal and my time working on Arkayne, I can guarantee its a skill every developer will need.

  1. oggy permalink

    Nice article, but I feel that the “UTF-8″ section is a bit of a misnomer. UTF-8 is just one of the encodings that fully support Unicode (i.e. you be sure that .encode(‘utf-8′) on a Unicode string will not result in an UnicodeEncodeError). I know it seems like nitpicking, but people often don’t get the relationship between Unicode strings, bytestrings and encodings, so I think it needs to be made clear that Unicode != UTF-8.

  2. Hi,

    A few misconceptions that might be worth clearing:

    > The problem is that UTF-8 encoding has to be
    > handled correctly all the way through
    > the project.

    This is, for the most part, incorrect. UTF-8 is a great encoding to exchange data over files or sockets, *but* internally you should ONLY use Unicode types. The Unicode type is non-ambiguous, and that’s a very important property.

    See, the difficulty there is that we humans really want /characters/, but the only thing computers know is /bytes/. The problem with bytes is that they can map to characters in one freaking hundred different ways, and so when you get the bytes 0xC3 0xA9, they can mean the character ‘é’ or the characters ‘é’ or something else entirely, and you can’t a-priori be certain which.

    Whereas when you get a Unicode string containing ‘é’, then there is no ambiguity whatsoever that this truly is the latin small letter ‘e’ with an acute accent.

    In truth, the name of the Python type ’string’ is a misnommer, because it makes you think you’re dealing with a string of character, when in truth what you have is a string of bytes, and understanding the difference between the two is crucial to mastering the whole Unicode/encoding/i18n thing. (That why the type ’string’ will be renamed to ‘bytes’ in Python 3, and ‘unicode’ to ’string’, by the way.)

    This is why you should decode your byte data to Unicode as early as possible, and encode it back to bytes as late as possible.

    Meaning that instead of doing…

    > data.decode(”input_charset_here”).encode(’utf-8′)

    … you should only ever do ‘data.decode()’ when reading data from a file or a socket, and ‘data.encode()’ when sending the data to a file or a socket.

    As of a good while ago, Django explicitely expects you to work with Unicode internally. It’ll do the encoding for you wherever needed.

    > … “latin1″, also known as ASCII …

    This is incorrect. Latin1 is also known as ISO-8859-1, but is NOT the same thing as ASCII. Don’t mix up the two or you are headed straight for encoding/decoding exceptions. :)

    > You can also control MySQL charset encoding
    > from within the client.

    Warning, the setting you’re linking to there controls the CLIENT encoding, which is NOT the same thing as the database encoding.

    Let me explain.

    We humans really, truly want to work with characters. When working with databases, we’re generally storing and retrieving intelligible words, not binary numbers. However, the database, being a computer thing, is adamant about wanting binary numbers. So we have to /encode/ our characters so the database can deal with the resulting /bytes/. Problem is, there are two places where the conversion is relevant.

    When you’re storing data in a database, FIRST you need to get your data to the database engine (using a local socket or a network connection), and THEN the database needs to store the data on the disk. Two steps, two places where encodings are involved.

    The database encoding, the one you set using ‘DEFAULT CHARSET’ when creating the database, determines: 1) which characters you’ll be able to store in the database (because not all characters can be represented in every encoding), and 2) the size of the database (because in some encodings, a character can map to several bytes). That’s about it.

    The client encoding determines what encoding the bytes transiting over the SQL socket are using.

    These two encodings can be different, it’s no matter so long as the characters you really want stored can be represented in both.

    What really matters is that when you receive bytes from or send bytes to the SQL socket, you use the same encoding as declared in the client encoding to transform between those bytes and actual characters.

    In practice, UTF-8 is one of those encodings that can map any character in existence at only a small cost in space, so go with it.

    As a rule of thumb, the whole Unicode thing is not as complicated as people generally think it is, but there are a few things you absolutely need to know or else you’ll be spending the rest of your career encoding and decoding things in any way imaginable, never being really sure what’s going on.

    Joel Spolsky wrote an *excellent* article about Unicode. Go read it. Seriously. This article is not very long, yet acts like one of those magic books in the game Oblivion: read it and instantly gain a level. I’m (almost) not kidding. :)

    http://www.joelonsoftware.com/articles/Unicode.html

    You’re lucky enough to be a Python programmer. Python is one of the few popular languages that doesn’t do stupid things with Unicode, but instead helps you do the right thing. Don’t spoil that chance. :)

  3. I reported the ascii encoding bug to python

    http://bugs.python.org/issue3648

    somehow they could not understand it.

  4. For the MySQL issue, you can also use ALTER to set the character set and collation on the database, table and then columns. If you only have a few columns or a lot of data, this may be quicker/easier than a dump and import.

  5. Has anyone seen the problem that when django creates the test database it is not utf8.
    Any on have any idea on how to get around this.

    I have the DEFAULT_CHARSET set to utf-8.

  6. Has anyone seen the problem that when django creates the test database it is not utf8.
    Any on have any idea on how to get around this.

    I have the DEFAULT_CHARSET set to utf-8.

  7. Try setting it to 'utf8', I remember something about MySQL issues where it doesn't recognize the charset with a dash.

    If that doesn't work try logging into the DB after startup and running:

    SET NAMES 'charset_name'
    SET CHARACTER SET charset_name

    http://dev.mysql.com/doc/refman/5.1/en/charset-...

    Its a hack but given the alternative of recompiling your DB to default to UTF8 it may be well worth it.

  8. vbabiy86 permalink

    So I tired remove the dash I get this error (1267, “Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='”)

    And when the Test_databse is being created it does kick off this command
    SET NAMES utf8

    But no SET CHARACTER SET charset_name

  9. Hmm, sounds like its working, the error above looks down stream.

    Also you could always try to dump your DB using mysqldump and then go through each table in the output and ensure UTF8 is charset.

    Maybe youre Django code is running a query attempting to compare tables with different charsets?

  10. AYY88 permalink

    MOST POPULAR, NEW ARRIVALS, SHOPPING HELP, CUSTOMER SERVICE, SIGN UP FOR EMAIL. Jordan Collection · NIKE BASKETBALL Shoes · NIKE SHOX SALE · NIKE CASUAL …

  11. Dwayne permalink

    I was looking for this everywhere. Worked wonders. Thanks.

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS