Friday, June 22, 2012

The right way to internationalize your Python app

Recently, as part of our push to ship only Python 3 on the Ubuntu 12.10 desktop, I've helped several projects update their internationalization (i18n) support.  I've seen lots of instances of suboptimal Python 2 i18n code, which leads to liberal sprinkling of cargo culted .decode() and .encode() calls simply to avoid the dreaded UnicodeErrors.  These get worse when the application or library is ported to Python 3 because then even the workarounds aren't enough to prevent nasty failures in non-ASCII environments (i.e. the non-English speaking world majority :).

Let's be honest though, the problem is not because these developers are crappy coders! In fact, far from it, the folks I've talked with are really really smart, experienced Pythonistas.  The fundamental problem is Python 2's 8-bit string type which doubles as a bytes type, and the terrible API of the built-in Python 2 gettext module, which does its utmost to sabotage your Python 2 i18n programs.  I take considerable blame for the latter, since I wrote the original version of that module.  At the time, I really didn't understand unicodes (this is probably also evident in the mess I made of the email package).  Oh, to really have access to Guido's time machine.

The good news is that we now know how to do i18n right, especially in a bilingual Python 2/3 world, and the Python 3 gettext module fixes the most egregious problems in the Python 2 version.  Hopefully this article does some measure of making up for my past sins.

Stop right here and go watch Ned Batchelder's talk from PyCon 2012 entitled Pragmatic Unicode, or How Do I Stop the Pain?  It's the single best description of the background and effective use of Unicode in Python you'll ever see.  Ned does a brilliant job of resolving all the FUD.

...

Welcome back.  Your Python application is multi-language friendly, right?  I mean, I'm as functionally monolinguistic as most Americans, but I love the diversity of languages we have in the world, and appreciate that people really want to use their desktop and applications in their native language.  Fortunately, once you know the tricks it's not that hard to write good i18n'd Python code, and there are many good FLOSS tools available for helping volunteers translate your application, such as Pootle, Launchpad translations, Translatewiki, Transifex, and Zanata.

So there really is no excuse not to i18n your Python application.  In fact, GNU Mailman has been i18n'd for many years, and pioneered the supporting code in Python's standard library, namely the gettext module.  As part of the Mailman 3 effort, I've also written a higher level library called flufl.i18n which makes it even easier to i18n your application, even in tricky multi-language contexts such as server programs, where you might need to get a German translation and a French translation in one operation, then turn around and get Japanese, Italian, and English for the next operation.

In one recent case, my colleague was having a problem with a simple command line program.  What's common about these types of applications is that you fire them up once, they run to completion then exit, and they only have to deal with one language during the entire execution of the program, specifically the language defined in the user's locale.  If you read the gettext module's documentation, you'd be inclined to do this at the very start of your application:

from gettext import gettext as _
gettext.textdomain(my_program_name)

then, you'd wrap translatable strings in code like this:

print _('Here is something I want to tell you')

What gettext does is look up the source string (i.e. the argument to the underscore function) in a translation catalog, returning the text in the appropriate language, which will then be printed.  There are some additional details regarding i18n that I won't go into here.  If you're curious, ask in the comments, and I'll try to fill things in.

Anyway, if you do write the above code, you'll be in for a heap of trouble, as my colleague soon found out.  Just running his program with --help in a French locale, he was getting the dreaded UnicodeEncodeError:

"UnicodeEncodeError: 'ascii' codec can't encode character"

I've also seen reports of such errors when trying to send translated strings to a log file (a practice which I generally discourage, since I think log messages usually shouldn't be translated).  In any case, I'm here to tell you why the above "obvious" code is wrong, and what you should do instead.

First, why is that code wrong, and why does it lead to the UnicodeEncodeErrors?  What might not be obvious from the Python 2 gettext documentation is that gettext.gettext() always returns 8-bit strings (a.k.a. byte strings in Python 3 terminology), and these 8-bit strings are encoded with the charset defined in the language's catalog file.

It's always best practice in Python to deal with human readable text using unicodes.  This is traditionally more problematic in Python 2, where English programs can cheat and use 8-bit strings and usually not crash, since their character range is compatible with ASCII and you only ever print to English locales.  As soon as your French friend uses your program though, you're probably going to run into trouble.  By using unicodes everywhere, you can generally avoid such problems, and in fact it will make your life much easier when you eventually switch to Python 3.

So the 8-bit strings that gettext.gettext() hands you have already sunk you, and to avoid the pain, you'd want to convert them back to unicodes before you use them in any way.  However, converting to unicodes makes the i18n APIs much less convenient, so no one does it until there's way too much broken code to fix.

What you really want in Python 2 is something like this:

from gettext import ugettext as _

which you'd think you should be able to do, the "u" prefix meaning "give me unicode".  But for reasons I can only describe as based on our misunderstandings of unicode and i18n at the time, you can't actually do that, because ugettext() is not exposed as a module-level function.  It is available in the class-based API, but that's a more advanced API that again almost no one uses.  Sadly, it's too late to fix this in Python 2.  The good news is that in Python 3 it is fixed, not by exposing ugettext(), but by changing the most commonly used gettext module APIs to return unicode strings directly, as it always should have done.  In Python 3, the obvious code just works:

from gettext import gettext as _

What can you do in Python 2 then?  Here's what you should use instead of the two lines of code at the beginning of this article:

_ = gettext.translation(my_program_name).ugettext

and now you can wrap all your translatable strings in _('Foo') and it should Just Work.

Perhaps more usefully, you can use the gettext.install() function to put _() into the built-in namespace, so that all your other code can just use that function without doing anything special.  Again, though we have to work around the boneheaded Python 2 API.  Here's how to write code which works correctly in both Python 2 and Python 3.

import sys, gettext
kwargs = {}
if sys.version_info[0] < 3:
    # In Python 2, ensure that the _() that gets installed into built-ins
    # always returns unicodes.  This matches the default behavior under Python
    # 3, although that keyword argument is not present in the Python 3 API.
    kwargs['unicode'] = True
gettext.install(my_program_name, **kwargs)

Or you can use the flufl.i18n API, which always uses returns unicode strings in both Python 2 and Python 3.

Also interesting was that I could never reproduce the crash when ssh'd into the French locale VM. It would only crash for me when I was logged into a terminal on the VM's graphical desktop.  The only difference between the two that I could tell was that in the desktop's terminal, locale(8) returned French values (e.g. fr_FR.UTF-8) for everything, but in the ssh console, it returned the French values for everything except the LC_CTYPE environment variable.  For the life of me, I could not get LC_CTYPE set to anything other than en_US.UTF-8 in the ssh context, so the reproducible test case would just return the English text, and not crash.  This happened even if I explicitly set that environment variable either as a separate export command in the shell, or as a prefix to the normally crashing command.  Maybe there's something in ssh that causes this, but I couldn't find it.

One last thing.  It's important to understand that Python's gettext module only handles Python strings, and other subsystems may be involved.  The classic example is GObject Introspection, the newest and recommended interface to the GNOME Object system.  If your Python-GI based project needs to translate strings too (e.g. in menus or other UI elements), you'll have to use both the gettext API for your Python strings, and set the locale for the C-based bits using locale.setlocale().  This is because Python's API does not set the locale automatically, and Python-GI exposes no other way to control the language it uses for translations.