Unicode/UTF-8 Notes: I18N Gotchas

Upcoming.org was originally launched on PHP4 and I believe, MySQL 3.23. As you might imagine, internationalization (i18n) wasn’t at the top of the priority list. As time has gone on, more and more international users have started to use Upcoming. With global geocoding and time zones off our plate, it was time to tackle character encoding.

Now, everyone knows that that i18n is easy: just use UTF-8 across the board and you’re done. The reality of course, is more complicated. While I don’t think there’s any disagreement that Unicode is a Good Thingâ„¢, even a decade and a half in, support is still wildly uneven in different environments and programming languages. No doubt these notes will be out of date soon, but I’m writing this down so I know where I can pick up next time.

Development Environment

The first thing to do is to make sure that you’re working in a UTF-8 aware environment. Browsers are good with encodings (a little too good with character-set autodetection, but you can force them into whatever display you want, more later). OS X is Unicode friendly and Terminal.app is a good place to start. Linux is also pretty good, you can check what you’re locale settins are w/ “locale” – RHEL defaults to en_US.UTF-8, so display in xterms actually work out of the box (although the Unicode fonts seem to be missing lots of codepoints).

Beyond your locale, you’ll also want to make sure that your pager can display UTF-8. You can export LESSCHARSET=utf-8 into your bashrc.

Lastly, and for me, the most important tool is Vim. Version 7 has improved unicode support, but for me, v6.1 worked fine in binary mode. Being able to edit a file (vim -b), move over to the character, and type g8 to get the hex code points was the easiest way for me to verify transcoding/character set issues.

MySQL

We’re now using MySQL 4.1.x, the later builds which have pretty good character support. There are still outstanding collation issues, and things like UTF-8 corruption in InnoDB for <4.1.16 and character_set_name problems <4.1.19 are troubling, but overall, once you the bajillion character_set variables to "utf8", thing seem to work alright. Since we remained in latin one when we went from 4.0 to 4.1, we avoided some big headaches, especially since we actually were storing many different character encodings (MySQL 4.0 and before simply treated values as binaries, so this wasn’t a problem). A couple of alters later and we’re now in utf8/utf8_general_ci with a mixed-bag of binary data (untouched by the alters).

Python

Of the languages that we use, Python has the best unicode support (just watch out for the wacky print behavior; another howto). That and Mark Pilgrim’s chardet package made it a no-brainer to do character set conversions in Python. Unfortunately, I ran into some hairiness with trying to get MySQLdb to write UTF-8. That eventually got solved (if you’re running a setup w/ broken set_character_set() then force it through the my.cnf), I worked around that problem by sending the UTF-8 through the mysqlclient.

Perl

Perl has had, um, issues with Unicode. 5.8 apparently fixes lots of things – too bad I’m running 5.6.1. Still, I didn’t have too many problems with it. The localized scope utf8 pragmas are sort of weird, but seem to work as advertised (so glad I coded all my string handling in a modular manner). The one problem I had with DBI not returning UTF-8 was solved with a quick search – interestingly, SET NAMES worked perfectly for DBI even when it did bupkiss for MySQLdb.

PHP

PHP had potentially the worst Unicode handling, but PHP5 improves things a lot, with native UTF-8 support (just remember to set it as the default character set). Since the stuff I was working on didn’t require string manipluations at all, I completely lucked out here. The only thing I needed to fix was a stray utf8_encode() call that was crushing the UTF-8.

Summary

So, ironically, I ended up having the most problems with the language with the most complete Unicode support. This whole deal was quite the PITA, but on the bright side, it’s done, and the everything seems to have squeaked by with at least a passable level of workingness. (I’m not sure this would have worked so “smoothly” even as late as last year.)