Friday, June 22, 2012

The right way to internationalize your Python app

Recently, as part of our push to ship only Python 3 on the Ubuntu 12.10 desktop, I've helped several projects update their internationalization (i18n) support.  I've seen lots of instances of suboptimal Python 2 i18n code, which leads to liberal sprinkling of cargo culted .decode() and .encode() calls simply to avoid the dreaded UnicodeErrors.  These get worse when the application or library is ported to Python 3 because then even the workarounds aren't enough to prevent nasty failures in non-ASCII environments (i.e. the non-English speaking world majority :).

Let's be honest though, the problem is not because these developers are crappy coders! In fact, far from it, the folks I've talked with are really really smart, experienced Pythonistas.  The fundamental problem is Python 2's 8-bit string type which doubles as a bytes type, and the terrible API of the built-in Python 2 gettext module, which does its utmost to sabotage your Python 2 i18n programs.  I take considerable blame for the latter, since I wrote the original version of that module.  At the time, I really didn't understand unicodes (this is probably also evident in the mess I made of the email package).  Oh, to really have access to Guido's time machine.

The good news is that we now know how to do i18n right, especially in a bilingual Python 2/3 world, and the Python 3 gettext module fixes the most egregious problems in the Python 2 version.  Hopefully this article does some measure of making up for my past sins.

Stop right here and go watch Ned Batchelder's talk from PyCon 2012 entitled Pragmatic Unicode, or How Do I Stop the Pain?  It's the single best description of the background and effective use of Unicode in Python you'll ever see.  Ned does a brilliant job of resolving all the FUD.

...

Welcome back.  Your Python application is multi-language friendly, right?  I mean, I'm as functionally monolinguistic as most Americans, but I love the diversity of languages we have in the world, and appreciate that people really want to use their desktop and applications in their native language.  Fortunately, once you know the tricks it's not that hard to write good i18n'd Python code, and there are many good FLOSS tools available for helping volunteers translate your application, such as Pootle, Launchpad translations, Translatewiki, Transifex, and Zanata.

So there really is no excuse not to i18n your Python application.  In fact, GNU Mailman has been i18n'd for many years, and pioneered the supporting code in Python's standard library, namely the gettext module.  As part of the Mailman 3 effort, I've also written a higher level library called flufl.i18n which makes it even easier to i18n your application, even in tricky multi-language contexts such as server programs, where you might need to get a German translation and a French translation in one operation, then turn around and get Japanese, Italian, and English for the next operation.

In one recent case, my colleague was having a problem with a simple command line program.  What's common about these types of applications is that you fire them up once, they run to completion then exit, and they only have to deal with one language during the entire execution of the program, specifically the language defined in the user's locale.  If you read the gettext module's documentation, you'd be inclined to do this at the very start of your application:

from gettext import gettext as _
gettext.textdomain(my_program_name)

then, you'd wrap translatable strings in code like this:

print _('Here is something I want to tell you')

What gettext does is look up the source string (i.e. the argument to the underscore function) in a translation catalog, returning the text in the appropriate language, which will then be printed.  There are some additional details regarding i18n that I won't go into here.  If you're curious, ask in the comments, and I'll try to fill things in.

Anyway, if you do write the above code, you'll be in for a heap of trouble, as my colleague soon found out.  Just running his program with --help in a French locale, he was getting the dreaded UnicodeEncodeError:

"UnicodeEncodeError: 'ascii' codec can't encode character"

I've also seen reports of such errors when trying to send translated strings to a log file (a practice which I generally discourage, since I think log messages usually shouldn't be translated).  In any case, I'm here to tell you why the above "obvious" code is wrong, and what you should do instead.

First, why is that code wrong, and why does it lead to the UnicodeEncodeErrors?  What might not be obvious from the Python 2 gettext documentation is that gettext.gettext() always returns 8-bit strings (a.k.a. byte strings in Python 3 terminology), and these 8-bit strings are encoded with the charset defined in the language's catalog file.

It's always best practice in Python to deal with human readable text using unicodes.  This is traditionally more problematic in Python 2, where English programs can cheat and use 8-bit strings and usually not crash, since their character range is compatible with ASCII and you only ever print to English locales.  As soon as your French friend uses your program though, you're probably going to run into trouble.  By using unicodes everywhere, you can generally avoid such problems, and in fact it will make your life much easier when you eventually switch to Python 3.

So the 8-bit strings that gettext.gettext() hands you have already sunk you, and to avoid the pain, you'd want to convert them back to unicodes before you use them in any way.  However, converting to unicodes makes the i18n APIs much less convenient, so no one does it until there's way too much broken code to fix.

What you really want in Python 2 is something like this:

from gettext import ugettext as _

which you'd think you should be able to do, the "u" prefix meaning "give me unicode".  But for reasons I can only describe as based on our misunderstandings of unicode and i18n at the time, you can't actually do that, because ugettext() is not exposed as a module-level function.  It is available in the class-based API, but that's a more advanced API that again almost no one uses.  Sadly, it's too late to fix this in Python 2.  The good news is that in Python 3 it is fixed, not by exposing ugettext(), but by changing the most commonly used gettext module APIs to return unicode strings directly, as it always should have done.  In Python 3, the obvious code just works:

from gettext import gettext as _

What can you do in Python 2 then?  Here's what you should use instead of the two lines of code at the beginning of this article:

_ = gettext.translation(my_program_name).ugettext

and now you can wrap all your translatable strings in _('Foo') and it should Just Work.

Perhaps more usefully, you can use the gettext.install() function to put _() into the built-in namespace, so that all your other code can just use that function without doing anything special.  Again, though we have to work around the boneheaded Python 2 API.  Here's how to write code which works correctly in both Python 2 and Python 3.

import sys, gettext
kwargs = {}
if sys.version_info[0] < 3:
    # In Python 2, ensure that the _() that gets installed into built-ins
    # always returns unicodes.  This matches the default behavior under Python
    # 3, although that keyword argument is not present in the Python 3 API.
    kwargs['unicode'] = True
gettext.install(my_program_name, **kwargs)

Or you can use the flufl.i18n API, which always uses returns unicode strings in both Python 2 and Python 3.

Also interesting was that I could never reproduce the crash when ssh'd into the French locale VM. It would only crash for me when I was logged into a terminal on the VM's graphical desktop.  The only difference between the two that I could tell was that in the desktop's terminal, locale(8) returned French values (e.g. fr_FR.UTF-8) for everything, but in the ssh console, it returned the French values for everything except the LC_CTYPE environment variable.  For the life of me, I could not get LC_CTYPE set to anything other than en_US.UTF-8 in the ssh context, so the reproducible test case would just return the English text, and not crash.  This happened even if I explicitly set that environment variable either as a separate export command in the shell, or as a prefix to the normally crashing command.  Maybe there's something in ssh that causes this, but I couldn't find it.

One last thing.  It's important to understand that Python's gettext module only handles Python strings, and other subsystems may be involved.  The classic example is GObject Introspection, the newest and recommended interface to the GNOME Object system.  If your Python-GI based project needs to translate strings too (e.g. in menus or other UI elements), you'll have to use both the gettext API for your Python strings, and set the locale for the C-based bits using locale.setlocale().  This is because Python's API does not set the locale automatically, and Python-GI exposes no other way to control the language it uses for translations.

11 comments:

  1. I haven't used it yet, but I just read the documentation of flufl.i18n, and also this blog post, of course.

    I notice the use of '_' as a "quick-and-dirty" function name for access to the string translation service in both cases. This concerns me a bit, and I think I would advocate for recommending a different convention. Specifically, I quite often use '_' already in my code as a placeholder variable, and this seems to be a common convention lots of places. E.g.:

    important,_,_,good_stuff,_ = struct_of_some_use(foo)

    Lots of times I'm interested in some but not all of the return values (i.e. a returned tuple) from some call, and the '_' convention to indicate which ones I don't care about in this context is used a lot.

    Anyway, it's obvious enough that 'from gettext import gettext as _' is going to interact badly with that existing code that already uses the '_' convention.

    Perhaps the i18n stuff would be better served by picking a different single letter (or pair at worse). That doesn't make conflict impossible, but at least it doesn't step on a widespread use.

    ReplyDelete
  2. SSH can set some env variables from the client machine. If you do 'man ssh' and then search for "Environment". I don't see LC_CTYPE there, but I'm pretty sure I've seen ssh set LANG or something like that before.

    ReplyDelete
  3. "I could not get LC_CTYPE set to anything other than en_US.UTF-8 in the ssh context, so the reproducible test case would just return the English text"

    LC_CTYPE only affects isalnum/toupper etc, and shouldn't affect messaging at all, so I'm not sure how that could happen.

    ReplyDelete
  4. (OPENID Y U NO USE THE CORRECT NAME?)

    ReplyDelete
  5. @David: I've seen that idiom occasionally, but IME it isn't that common. _() was chosen because of the long history in GNU gettext convention. It's used all over in C code, and the Python API was modeled on the GNU gettext API. Fortunately, in Python it's easy enough to change your import statement if the name clashes.

    @jam: Yep, I did check into ssh and couldn't find any mention of LC_CTYPE there either.

    @effbot: Right, I have no idea whether the LC_CTYPE difference is what caused it to not crash under ssh (TBH, I didn't spend a lot of time digging into it), but it was the only locale(8) output that differed.

    ReplyDelete
  6. One other little unicode happiness hint I should mention. I've taken to always adding this to the top of my Python 2 code:

    from __future__ import unicode_literals

    (I also import absolute_import but that's a different matter). What's nice is that I now ensure that all my literals are unicode without the pesky u'' prefix. Only occasionally does this "byte" me, but a b'' prefix takes care of that (yep, I generally try to limit my support to >= 2.6).

    ReplyDelete
    Replies
    1. Did not know about the ``unicode_literals`` trick. This is awesome.

      Delete
  7. Instead of

    _ = gettext.translation(my_program_name).ugettext

    you can also do

    gettext.install(my_program_name, unicode=True)

    for application-wide installation of the gettext function(s).

    ReplyDelete
  8. A good reason for sticking with _ for gettext is that tools used to extract message IDs from source code usually expect this notation.

    Sometimes it conflicts with _ used as a do-not-care variable name. Sometimes it conflicts with the use of _ as the result of the last expression in an interactive Python prompt or doctests (that was a fun debugging session). Those conflicts can be worked around.

    LC_CTYPE is used to specify the locale encoding, so it's pretty important. Try this in a shell:
    LC_CTYPE=C python -c 'import sys; print sys.stdout.encoding'. That might explain the (lack of) Unicode errors when printing non-ASCII strings when LC_CTYPE is different, but you said you were getting different translations as well? Those are determined by LC_ALL/LC_MESSAGES/LANG or LANGUAGE, and I'm not quite sure what the priorities of the respective environment variables are.

    ReplyDelete
  9. Weblate is the best web based tool that helps with translation IMHO (not only due to awesome Git repo integration)

    ReplyDelete
  10. gettext_lazy works better for me.
    _("Foreign language ....").title() should be done if returned as JSON

    ReplyDelete