« Porting Aware To Django | Home | New Django Site: Samz Market and Gourmet Foods »
Django Tips: UTF-8, ASCII Encoding Errors, Urllib2, and MySQL
By Paul Kenjora | September 24, 2008
Having completed many Django projects over the past two years, I’ve started to take some seemingly trivial things for granted. Cross referencing a project for one of these solutions a few days back made me realize that these things are not trivial to others. I remember looking through countless pags and forums for the answers, only to find questions. The three main bits that were Django stumbling blocks for me are:
UTF-8
Trying to get UTF-8 working in my Django projects was always a nightmare. For some reason it was a black art trying to get all the pieces to work. I spent days twiddling settings, using "encode" and "decode" any which way imaginable. The problem is that UTF-8 encoding has to be handled correctly all the way through the project. At the entry point, when storing to DB, and when displayed to the user. Keep in mind that a UTF-8 is the final encoding you want in your DB, the source and output can and will be different. For encoding issues be methodical, do NOT take shortcuts, make sure your entry points convert correctly.
For any charset to utf-8 conversion all you need is: data.decode(“input_charset_here”).encode(‘utf-8′)
Also make sure your MySQL is set up correctly, covered below….
Urllib2
The urllib2 library in python is an excellent example of an input where encoding will be an issue. For the Arkayne project I pull in pages and analyze contents. Foreign pages with non ASCII encoding threw ASCII encode exceptions. I was exploiting the fact that ASCII is a subset of UTF-8 but that doesn’t work for long, its a hack. I tried using ignore for the encode function but I still got errors or missed entire pages. The urllib2, and urllib for that matter, library does not handle character set encodings by design. Urllib2 is built to connect to sockets and fetch data using the http protocol. Unfortunately there is no commonly used standard for passing http encoding so its not handled by the library.
There is a silver lining however, most servers do provide a Content-Type charset value. HTTP standards do spec charset but like I said not everyone uses it consistently. In this case however it is the best thing to go on if you are importing pages from the web. So in order to pull in a page from the web with the correct encoding using urllib2 and convert it to UTF-8 you do the following:
MySQL
This is a pesky one. MySQL has a serious problem in that the default charset is determined at compile time and it defaults to "latin1", also known as ASCII. Why default is not UTF-8, I do not know. So if you’ve already compiled and installed MySQL then every time you set up a DB there is a bit of extra work. There are ways to change the MySQL charset but they require stopping and starting the server with special options. If you have an existing database, it is already set to "latin1" so the steps above wont help anyway. You can also control MySQL charset encoding from within the client. Here is what I usually do to convert or set up a project correctly…
Set Up Project
Make sure to add this to your Django settings file: DEFAULT_CHARSET = ‘utf-8′
In mysql type: SET NAMES utf8;
After that type: create database db_name;
Run the Django syncdb utility.
Convert Existing Project
Dump the project database using: mysqldump -D db_name > db_name.mysql
Edit the dumped text file replacing: ENGINE=MyISAM DEFAULT CHARSET=latin1; with ENGINE=MyISAM DEFAULT CHARSET=utf8;
In MySQL drop the database you just dumped: drop database db_name;
Then type: SET NAMES utf8;
After that type: create database db_name;
Exit MySQL and import the edited dump file: mysql -D db_name < db_name.mysql
Final Thoughts
The world is becoming increasingly global and so is development. Most web applications today should expect non-ASCII characters. If you have not done so already, master the art of charset conversion. In my years at PayPal and my time working on Arkayne, I can guarantee its a skill every developer will need.
More from Aware Labs
- Installing Django And MySQL On MacBook Air Or OS X
- Everything A Django Developer Needs To Create Logins
- Digg Style Pagination In Django Revisited
- When Django Apps Grow Up
- Outsourcing Killed By Django And Ruby On Rails
Aware Labs Recommends
Topics: Environment Setup, Tutorial | Comments
-
Vitaly Babiy
-
pkenjora
-
vbabiy86
-
pkenjora
-
Chris Scott
-
est
-
S.
-
oggy