Friday, May 10, 2013

Resource management in Python 3.3, or contextlib.ExitStack FTW!

I'm writing a bunch of new code these days for Ubuntu Touch's Image Based Upgrade system.  Think of it essentially as Ubuntu Touch's version of upgrading the phone/tablet (affectionately called phablet) operating system in a bulk way rather than piecemeal apt-gets the way you do it on a traditional Ubuntu desktop or server.  One of the key differences is that a phone has to detour through a reboot in order to apply an upgrade since its Ubuntu root file system is mounted read-only during the user session.

Anyway, those details aren't the focus of this article.  Instead, just realize that because it's a pile of new code, and because we want to rid ourselves of Python 2, at least on the phablet image if not everywhere else in Ubuntu, I am prototyping all this in Python 3, and specifically 3.3.  This means that I can use all the latest and greatest cool stuff in the most recent stable Python release.  And man, is there a lot of cool stuff!

One module in particular that I'm especially fond of is contextlibContext managers are objects implementing the protocol behind the with statement, and they are typically used to guarantee that some resource is cleaned up properly, even in the event of error conditions.  When you see code like this:

with open(somefile) as fp:
    data = fp.read()

you are invoking a context manager.  Python was clever enough to make file objects support the context manager protocol so that you never have to explicitly close the file; that happens automatically when the with statement completes, regardless of whether the code inside the with statement succeeds or raises an exception.

It's also very easy to define your own context managers to properly handle other kinds of resources.  I won't go into too much detail here, because this is all well-established; the with statement has been, er, with us since Python 2.5.

You may be familiar with the contextlib module because of the @contextmanager decorator it provides.  This makes it trivial to define a new context manager without having to deal with all the intricacies of the protocol.  For example, here's how you would implement a context manager that temporarily changes the current working directory:

import os
from contextlib import contextmanager

@contextmanager
def chdir(dir):
    cwd = os.getcwd()
    try:
        os.chdir(dir)
        yield
    finally:
        os.chdir(cwd)

In this example, the yield cedes control back to the body of the with statement, and when that completes, the code after the yield is executed.  Because the yield is wrapped inside a try/finally, it is guaranteed that the original working directory is restored.  You would use this code like so:

with chdir('/tmp'):
    print(os.getcwd())

So far, so good, but this is nothing revolutionary.  Python 3.3 brings additional awesomeness to contextlib by way of the new ExitStack class.

The documentation for ExitStack is a bit dense, and even the examples didn't originally make it clear to me how amazing this new API is.  In my opinion, this is so powerful, it changes completely the way you think about deploying safe code.

So what is an ExitStack?  One way to think about it is as an extensible context manager.  It's used in with statements just like any other context manager:

from contextlib import ExitStack
with ExitStack() as stack:
    # do some magical stuff

Just like any other context manager, the ExitStack's "exit" code is guaranteed to be run at the end of the with statement.  It's the programmable extensibility of the ExitStack where the cool stuff happens.

The first interesting method of an ExitStack you might use is the callback() method.  Let's say for example that in your with statement, you are creating a temporary directory and you want to make sure that temporary directory gets deleted when the with statement exits.  You could do something like this:

import shutil, tempfile
with ExitStack() as stack:
    tempdir = tempfile.mkdtemp()
    stack.callback(shutil.rmtree, tempdir)

Now, when the with statement completes, it calls all of its callbacks, which includes removing the temporary directory.

So, what's the big deal?  Let's say you're actually creating three temporary directories and any of those calls could fail.  To guarantee that all successfully created directories are deleted at the end of the with statement, regardless of whether an exception occurred in the middle, you could do this:

with ExitStack() as stack:
    tempdirs = []
    for i in range(3):
        tempdir = tempfile.mkdtemp()
        stack.callback(shutil.rmtree, tempdir)
        tempdirs.append(tempdir)
    # Do something with the tempdirs

If you knew statically that you wanted three temporary directories, you could set this up with nested with statements, or a single with statement containing multiple backslash-separated targets, but that gets unwieldy very quickly.  And besides, that's impossible if you only know the number of directories you need dynamically at run time.  On the other hand, the ExitStack makes it easy to guarantee everything gets cleaned up and there are no leaks.

That's powerful enough, but it's not all you can do!  Another very useful method is enter_context().

Let's say that you are opening a bunch of files and you want the following behavior: if all of the files open successfully, you want to do something with them, but if any of them fail to open, you want to make sure that the ones that did get open are guaranteed to get closed.  Using ExitStack.enter_context() you can write code like this:

files = []
with ExitStack() as stack:
    for filename in filenames:
        # Open the file and automatically add its context manager to the stack.
        # enter_context() returns the passed in context manager, i.e. the 
        # file object.
        fp = stack.enter_context(open(filename))
        files.append(fp)
    # Capture the close method, but do not call it yet.
    close_all_files = stack.pop_all().close

(Note that the contextlib documentation contains a more efficient, but denser way of writing the same thing.)

So what's going on here?  First, the open(filename) does what it always does of course, it opens the file and returns a file object, which is also a context manager.  However, instead of using that file object in a with statement, we add it to the ExitStack by passing it to the enter_context() method.  For convenience, this method returns the passed in object.

So what happens if one of the open() calls fail before the loop completes?  The with statement will exit as normal and the ExitStack will exit all the context managers it knows about.  In other words, all the files that were successfully opened will get closed.  Thus, in an error condition, you will be left with no open files and no leaked file descriptors, etc.

What happens if the loop completes and all files got opened successfully?  Ah, that's where the next bit of goodness comes into play: the ExitStack's pop_all() method.

pop_all() creates a new ExitStack, and populates it from the original ExitStack, removing all the context managers from the original ExitStack.  So, after stack.pop_all() completes, the original ExitStack, i.e. the one used in the with statement, is now empty.  When the with statement exits, the original ExitStack contains no context managers so none of the files are closed.

Well, then, how do you close all the files once you're done with them?  That's the last bit of magic.  ExitStacks have a .close() method which unwinds all the registered context managers and callbacks and invokes their exit functionality.  So, after you're finally done with all the files and you want to clean everything up, you would just do:

close_all_files()

And that's it.

Hopefully that all makes sense.  I know it took a while to sink in for me, but now that it has, it's clear the enormous power this gives you.  You can write much safer code, in the sense that it's easier to ensure much better guarantees that your resources are cleaned up at the right time.

The real power comes when you have many different disparate resources to clean up for a particular operation.  For example, in the test suite for the Image Based Upgrader, I have a test where I need to create a temporary directory and start an HTTP server in a thread.  Roughly, my code looks like this:

@classmethod
def setUpClass(cls):
    cls._cleaner = ExitStack()
    try:
        cls._serverdir = tempfile.mkdtemp()
        cls._cleaner.callback(shutil.rmtree, cls._serverdir)
        # ...
        cls._stop = make_http_server(cls._serverdir)
        cls._cleaner.callback(cls._stop)
    except:
        cls._cleaner.pop_all().close()
        raise

@classmethod
def tearDownClass(cls):
    cls._cleaner.close()

Notice there's no with statement there at all. :)   This is because the resources must remain open until tearDownClass() is called, unless some exception occurs during the setUpClass().  If that happens, the bare except will ensure that all the context managers are properly closed, leaving the original ExitStack empty.  (The bare except is acceptable here because the exception is re-raised after the resources are cleaned up.)  Even though the exception will prevent the tearDownClass() from being called, it's still safe to do so in case it is called for some odd reason, because the original ExitStack is empty.

But if no exception occurs, the original ExitStack will contain all the context managers that need to be closed, and calling .close() on it in the tearDownClass() does exactly that.

I have one more example from my recent code.  Here, I need to create a GPG context (the details are unimportant), and then use that context to verify the detached signature of a file.  If the signature matches, then everything's good, but if not, then I want to raise an exception and throw away both the data file and the signature (i.e. .asc) file.  Here's the code:

with ExitStack() as stack:
    ctx = stack.enter_context(Context(pubkey_path))
    if not ctx.verify(asc_path, channels_path):
        # The signature did not verify, so arrange for the .json and .asc
        # files to be removed before we raise the exception.
        stack.callback(os.remove, channels_path)
        stack.callback(os.remove, asc_path)
        raise FileNotFoundError


Here we create the GPG context, which itself is a context manager, but instead of using it in a with statement, we add it to the ExitStack.  Then we verify the detached signature (asc_path) of a data file (channels_path), and only arrange to remove those files if the verification fails.  When the FileNotFoundError is raised, the ExitStack in the with statement unwinds, removing both files and closing the GPG context.  Of course, if the signature matches, only the GPG context is closed -- the channels_path and asc_path files are not removed.

You can see how an ExitStack actually functions as a fairly generic resource manager!

To me, this revolutionizes the management of external resources.  The new ExitStack object, and the methods and semantics it exposes, make it so much easier to manage those resources, guaranteeing that they get cleaned up at the right time, once and only once, regardless of whether errors occur or not.

ExitStack takes the already powerful concept of context managers and turns it up to 11.  There's more you can do, and it's worth spending some time reading the contextlib documentation in Python 3.3, especially the examples and recipes.

As I mentioned on Twitter, it's features like this that make using Python 2 seem downright barbaric.

Thursday, April 18, 2013

Python 3 Language Gotcha - and a short reminisce

There's a lot of Python nostalgia going around today, from Brett Cannon's 10 year anniversary of becoming a core developer, to Guido reminding us that he came to the USA 18 years ago.  Despite my stolen time machine keys, I don't want to dwell in the past, except to say that I echo much of what Brett says.  I had no idea how life changing it would be -- on both a personal and professional level -- when Roger Masse and I met Guido at NIST at the first Python workshop back in November 1994.  The lyric goes: what a long strange trip it's been, and that's for sure.  There were about 20 people at that first workshop, and 2500 at Pycon 2013.

And Python continues to hold little surprises.  Just today, I solved a bug in an Ubuntu package that's been perplexing us for weeks.  I'd looked at the code dozens of times and saw nothing wrong.  I even knew about the underlying corner of the language, but didn't put them together until just now.  Here's a boiled down example, see if you can spot the bug!

import sys

def bar(i):
    if i == 1:
        raise KeyError(1)
    if i == 2:
        raise ValueError(2)


def bad():
    e = None
    try:
        bar(int(sys.argv[1]))
    except KeyError as e:
        print('ke')
    except ValueError as e:
        print('ve')
    print(e)

bad()


Here's a hint: this works under Python 2, but gives you an UnboundLocalError on the `e` variable under Python 3.

Why?

The reason is that in Python 3, the targets of except clauses are `del`d from the current namespace after the try...except clause executes.  This is to prevent circular references that occur when the exception is bound to the target.  What is surprising and non-obvious is that the name is deleted from the namespace even if it was bound to a variable before the exception handler!  So really, setting `e = None` did nothing useful!

Python 2 doesn't have this behavior, so in some sense it's less surprising, but at the expense of creating circular references.

The solution is simple.  Just use a different name to capture and use the exception outside of the try...except clause.  Here's a fixed example:

def good():
    exception = None
    try:
        bar(int(sys.argv[1]))
    except KeyError as e:
        exception = e
        print('ke')
    except ValueError as e:
        exception = e
        print('ve')
    print(exception)


So even after almost 20 years of hacking Python, you can still experience the thrill of discovering something new.

Wednesday, November 21, 2012

UDS Update #1 - OAuth

For UDS-R for Raring (i.e. Ubuntu 13.04) in Copenhagen, I sponsored three blueprints.  These blueprints represent most of the work I will be doing for the next 6 months, as we're well on our way to the next LTS, Ubuntu 14.04.

I'll provide some updates to the other blueprints later, but for now, I want to talk about OAuth and Python 3.  OAuth is a protocol which allows you to programmatically interact with certain website APIs, in an authenticated manner, without having to provide your website password.  Essentially, it allows you to generate an authorization token which you can use instead, and it allows you to manage and share these tokens with applications, so that you can revoke them if you want, or decide how and which applications to trust to act on your behalf.

A good example of a site that uses OAuth is Launchpad, but many other sites also support OAuth, such as Twitter and Facebook.

There are actually two versions of OAuth out there.  OAuth version 1 is definitely the more prevelent, since it has been around for years, is relatively simple (at least on the client side), and enshrined in RFC 5849.  There are tons of libraries available that support OAuth v1, in a multitude of languages, with Python being no exception.

OAuth v2 is much less common, since it is currently only a draft specification, and has had its share of design-by-committee controversy.  Still, some sites such as Facebook do require OAuth v2.

One of the very earliest Python libraries to support OAuth v1, on both the client and server side, was python-oauth (I'll use the Debian package names in this post), and on the Ubuntu desktop, you'll find lots of scripts and libraries that use python-oauth.  There are major problems with this library though, and I highly recommend not using it.  The biggest problems are that the code is abandoned by its upstream maintainer (it hasn't be updated on PyPI since 2009), and it is not Python 3 compatible.  Because the OAuth v2 draft came after this library was abandoned, it provides no support for the successor specification.

For this reason, one of the blueprints I sponsored was specifically to survey the alternatives available for Python programmers, and make a decision about which one we would officially endorse for Ubuntu.  By "official endorsement" I mean promote the library to other Python programmers (hence this post!) and to port all of our desktop scripts from python-oauth to the agreed upon library.

After some discussion, it was unanimous by the attendees of the UDS session (both in-person and remotely), to choose the python-oauthlib as our preferred library.

python-oauthlib has a lot going for it.  It's Python 3 compatible, has an active upstream maintainer, supports both RFC 5849 for v1, and closely follows the draft for v2.  It's a well-tested, solid library, and it is available in Ubuntu for both Python 2 and Python 3.  Probably the only negative is that the library does not provide any support for the server side.  This is not a major problem for our immediate plans, since there aren't any server applications on the Ubuntu desktop requiring OAuth.  Eventually, yes, we'll need server side support, but we can punt on that recommendation for now.

Another cool thing about python-oauthlib is that it has been adopted by the python-requests library, meaning, if you want to use a modern replacement for the urllib2/httplib2 circus which supports OAuth out of the box, you can just use python-requests, provide the appropriate parameters, and you get request signing for free.

So, as you'll see from the blueprint, there are several bugs linked to packages which need porting to python-oauthlib for Ubuntu 13.04, and I am actively working on them, though contributions, as always, are welcome!  I thought I'd include a little bit of code to show you how you might port from python-oauth to python-oauthlib.  We'll stick with OAuth v1 in this discussion.

The first thing to recognize is that python-oauth uses different, older terminology that predates the RFC.  Thus, you'll see references to a token key and token secret, as well as a consumer key and consumer secret.  In the RFC, and in python-oauthlib, these terms are client key, client secret, resource owner key, and resource owner secret respectively.  After you get over that hump, the rest pretty much falls into place.  As an example, here is a code snippet from the piston-mini-client library which used the old python-oauth library:

class OAuthAuthorizer(object):
    """Authenticate to OAuth protected APIs."""
    def __init__(self, token_key, token_secret, consumer_key, consumer_secret,
                 oauth_realm="OAuth"):
        """Initialize a ``OAuthAuthorizer``.

        ``token_key``, ``token_secret``, ``consumer_key`` and
        ``consumer_secret`` are required for signing OAuth requests.  The
        ``oauth_realm`` to use is optional.
        """
        self.token_key = token_key
        self.token_secret = token_secret
        self.consumer_key = consumer_key
        self.consumer_secret = consumer_secret
        self.oauth_realm = oauth_realm

    def sign_request(self, url, method, body, headers):
        """Sign a request with OAuth credentials."""
        # Import oauth here so that you don't need it if you're not going
        # to use it.  Plan B: move this out into a separate oauth module.
        from oauth.oauth import (OAuthRequest, OAuthConsumer, OAuthToken,
                                 OAuthSignatureMethod_PLAINTEXT)
        consumer = OAuthConsumer(self.consumer_key, self.consumer_secret)
        token = OAuthToken(self.token_key, self.token_secret)
        oauth_request = OAuthRequest.from_consumer_and_token(
            consumer, token, http_url=url)
        oauth_request.sign_request(OAuthSignatureMethod_PLAINTEXT(),
                                   consumer, token)
        headers.update(oauth_request.to_header(self.oauth_realm))


The constructor is pretty simple, and it uses the old OAuth terminology.  The key thing to notice is the way the old API required you to create a consumer, a token, and then a request object, then ask the request object to sign the request.  On top of all the other disadvantages, this isn't a very convenient API.  Let's look at the snippet after conversion to python-oauthlib.

class OAuthAuthorizer(object):
    """Authenticate to OAuth protected APIs."""
    def __init__(self, token_key, token_secret, consumer_key, consumer_secret,
                 oauth_realm="OAuth"):
        """Initialize a ``OAuthAuthorizer``.

        ``token_key``, ``token_secret``, ``consumer_key`` and
        ``consumer_secret`` are required for signing OAuth requests.  The
        ``oauth_realm`` to use is optional.
        """
        # 2012-11-19 BAW: python-oauthlib requires unicodes for its tokens and
        # secrets.  Assume utf-8 values.
        # https://github.com/idan/oauthlib/issues/68
        self.token_key = _unicodeify(token_key)
        self.token_secret = _unicodeify(token_secret)
        self.consumer_key = _unicodeify(consumer_key)
        self.consumer_secret = _unicodeify(consumer_secret)
        self.oauth_realm = oauth_realm

    def sign_request(self, url, method, body, headers):
        """Sign a request with OAuth credentials."""
        # 2012-11-19 BAW: In order to preserve API backward compatibility,
        # convert empty string body to None.  The old python-oauth library
        # would treat the empty string as "no body", but python-oauthlib
        # requires None.
        if not body:
            body = None
        # Import oauthlib here so that you don't need it if you're not going
        # to use it.  Plan B: move this out into a separate oauth module.
        from oauthlib.oauth1 import Client, SIGNATURE_PLAINTEXT
        oauth_client = Client(self.consumer_key, self.consumer_secret,
                              self.token_key, self.token_secret,
                              signature_method=SIGNATURE_PLAINTEXT,
                              realm=self.oauth_realm)
        uri, signed_headers, body = oauth_client.sign(
            url, method, body, headers)
        headers.update(signed_headers)


See how much nicer this is?  You need only create a client object, essentially using all the same bits of information.  Then you ask the client to sign the request, and update the request headers with the signature.  Much easier.

Two important things to note.  If you are doing an HTTP GET, there is no request body, and thus no request content which needs to contribute to the signature.  In python-oauth, you could specify an empty body by using either None or the empty string.  piston-mini-client uses the latter, and this is embodied in its public API.  python-oauthlib however, treats the empty string as a body being present, so it would require the Content-Type header to be set even for an HTTP GET which has no content (i.e. no body).  This is why the replacement code checks for an empty string being passed in (actually, any false-ish value), and coerces that to None.

The second issue is that python-oauthlib requires the keys and secrets to be Unicode objects; they cannot be bytes objects.  In code ported straight from Python 2 however, these values are usually 8-bit strings, and so become bytes objects in Python 3.  python-oauthlib will raise a ValueError during signing if any of these are bytes objects.  Thus the use of the _unicodeify() function to decode these values to unicodes.

def _unicodeify(s):
    if isinstance(s, bytes):
        return s.decode('utf-8')
    return s


The above works in both Python 2 and Python 3.  Of course, we don't know for sure that the bytes values are UTF-8, but it's the only sane encoding to expect, and if a client of piston-mini-client were to be so insane as to use an incompatible encoding (US-ASCII is fine because it's compatible with UTF-8), it would be up to the client to just pass in unicodes in the first place.  At the time of this writing, this is under active discussion with upstream, but for now, it's not too difficult to work around.

Anyway, I hope this helps, and I encourage you to help increase the popularity of python-oauthlib on the Cheeseshop, so that we can one day finally kill off the long defunct python-oauth library.

Friday, June 22, 2012

The right way to internationalize your Python app

Recently, as part of our push to ship only Python 3 on the Ubuntu 12.10 desktop, I've helped several projects update their internationalization (i18n) support.  I've seen lots of instances of suboptimal Python 2 i18n code, which leads to liberal sprinkling of cargo culted .decode() and .encode() calls simply to avoid the dreaded UnicodeErrors.  These get worse when the application or library is ported to Python 3 because then even the workarounds aren't enough to prevent nasty failures in non-ASCII environments (i.e. the non-English speaking world majority :).

Let's be honest though, the problem is not because these developers are crappy coders! In fact, far from it, the folks I've talked with are really really smart, experienced Pythonistas.  The fundamental problem is Python 2's 8-bit string type which doubles as a bytes type, and the terrible API of the built-in Python 2 gettext module, which does its utmost to sabotage your Python 2 i18n programs.  I take considerable blame for the latter, since I wrote the original version of that module.  At the time, I really didn't understand unicodes (this is probably also evident in the mess I made of the email package).  Oh, to really have access to Guido's time machine.

The good news is that we now know how to do i18n right, especially in a bilingual Python 2/3 world, and the Python 3 gettext module fixes the most egregious problems in the Python 2 version.  Hopefully this article does some measure of making up for my past sins.

Stop right here and go watch Ned Batchelder's talk from PyCon 2012 entitled Pragmatic Unicode, or How Do I Stop the Pain?  It's the single best description of the background and effective use of Unicode in Python you'll ever see.  Ned does a brilliant job of resolving all the FUD.

...

Welcome back.  Your Python application is multi-language friendly, right?  I mean, I'm as functionally monolinguistic as most Americans, but I love the diversity of languages we have in the world, and appreciate that people really want to use their desktop and applications in their native language.  Fortunately, once you know the tricks it's not that hard to write good i18n'd Python code, and there are many good FLOSS tools available for helping volunteers translate your application, such as Pootle, Launchpad translations, Translatewiki, Transifex, and Zanata.

So there really is no excuse not to i18n your Python application.  In fact, GNU Mailman has been i18n'd for many years, and pioneered the supporting code in Python's standard library, namely the gettext module.  As part of the Mailman 3 effort, I've also written a higher level library called flufl.i18n which makes it even easier to i18n your application, even in tricky multi-language contexts such as server programs, where you might need to get a German translation and a French translation in one operation, then turn around and get Japanese, Italian, and English for the next operation.

In one recent case, my colleague was having a problem with a simple command line program.  What's common about these types of applications is that you fire them up once, they run to completion then exit, and they only have to deal with one language during the entire execution of the program, specifically the language defined in the user's locale.  If you read the gettext module's documentation, you'd be inclined to do this at the very start of your application:

from gettext import gettext as _
gettext.textdomain(my_program_name)

then, you'd wrap translatable strings in code like this:

print _('Here is something I want to tell you')

What gettext does is look up the source string (i.e. the argument to the underscore function) in a translation catalog, returning the text in the appropriate language, which will then be printed.  There are some additional details regarding i18n that I won't go into here.  If you're curious, ask in the comments, and I'll try to fill things in.

Anyway, if you do write the above code, you'll be in for a heap of trouble, as my colleague soon found out.  Just running his program with --help in a French locale, he was getting the dreaded UnicodeEncodeError:

"UnicodeEncodeError: 'ascii' codec can't encode character"

I've also seen reports of such errors when trying to send translated strings to a log file (a practice which I generally discourage, since I think log messages usually shouldn't be translated).  In any case, I'm here to tell you why the above "obvious" code is wrong, and what you should do instead.

First, why is that code wrong, and why does it lead to the UnicodeEncodeErrors?  What might not be obvious from the Python 2 gettext documentation is that gettext.gettext() always returns 8-bit strings (a.k.a. byte strings in Python 3 terminology), and these 8-bit strings are encoded with the charset defined in the language's catalog file.

It's always best practice in Python to deal with human readable text using unicodes.  This is traditionally more problematic in Python 2, where English programs can cheat and use 8-bit strings and usually not crash, since their character range is compatible with ASCII and you only ever print to English locales.  As soon as your French friend uses your program though, you're probably going to run into trouble.  By using unicodes everywhere, you can generally avoid such problems, and in fact it will make your life much easier when you eventually switch to Python 3.

So the 8-bit strings that gettext.gettext() hands you have already sunk you, and to avoid the pain, you'd want to convert them back to unicodes before you use them in any way.  However, converting to unicodes makes the i18n APIs much less convenient, so no one does it until there's way too much broken code to fix.

What you really want in Python 2 is something like this:

from gettext import ugettext as _

which you'd think you should be able to do, the "u" prefix meaning "give me unicode".  But for reasons I can only describe as based on our misunderstandings of unicode and i18n at the time, you can't actually do that, because ugettext() is not exposed as a module-level function.  It is available in the class-based API, but that's a more advanced API that again almost no one uses.  Sadly, it's too late to fix this in Python 2.  The good news is that in Python 3 it is fixed, not by exposing ugettext(), but by changing the most commonly used gettext module APIs to return unicode strings directly, as it always should have done.  In Python 3, the obvious code just works:

from gettext import gettext as _

What can you do in Python 2 then?  Here's what you should use instead of the two lines of code at the beginning of this article:

_ = gettext.translation(my_program_name).ugettext

and now you can wrap all your translatable strings in _('Foo') and it should Just Work.

Perhaps more usefully, you can use the gettext.install() function to put _() into the built-in namespace, so that all your other code can just use that function without doing anything special.  Again, though we have to work around the boneheaded Python 2 API.  Here's how to write code which works correctly in both Python 2 and Python 3.

import sys, gettext
kwargs = {}
if sys.version_info[0] < 3:
    # In Python 2, ensure that the _() that gets installed into built-ins
    # always returns unicodes.  This matches the default behavior under Python
    # 3, although that keyword argument is not present in the Python 3 API.
    kwargs['unicode'] = True
gettext.install(my_program_name, **kwargs)

Or you can use the flufl.i18n API, which always uses returns unicode strings in both Python 2 and Python 3.

Also interesting was that I could never reproduce the crash when ssh'd into the French locale VM. It would only crash for me when I was logged into a terminal on the VM's graphical desktop.  The only difference between the two that I could tell was that in the desktop's terminal, locale(8) returned French values (e.g. fr_FR.UTF-8) for everything, but in the ssh console, it returned the French values for everything except the LC_CTYPE environment variable.  For the life of me, I could not get LC_CTYPE set to anything other than en_US.UTF-8 in the ssh context, so the reproducible test case would just return the English text, and not crash.  This happened even if I explicitly set that environment variable either as a separate export command in the shell, or as a prefix to the normally crashing command.  Maybe there's something in ssh that causes this, but I couldn't find it.

One last thing.  It's important to understand that Python's gettext module only handles Python strings, and other subsystems may be involved.  The classic example is GObject Introspection, the newest and recommended interface to the GNOME Object system.  If your Python-GI based project needs to translate strings too (e.g. in menus or other UI elements), you'll have to use both the gettext API for your Python strings, and set the locale for the C-based bits using locale.setlocale().  This is because Python's API does not set the locale automatically, and Python-GI exposes no other way to control the language it uses for translations.

Tuesday, April 24, 2012

Python 3 on the desktop for Quantal Quetzal

So, now all the world now knows that my suggested code name for Ubuntu 12.10, Qwazy Quahog, was not chosen by Mark.  Oh well, maybe I'll have more luck with Racy Roadrunner.

In any case, Ubuntu 12.04 LTS is to be released any day now so it's time for my semi-annual report on Python plans for Ubuntu.  I seem to write about this every cycle, so 12.10 is no exception.  We've made some fantastic progress, but now it's time to get serious.

For Ubuntu 12.10, we've made it a release goal to have Python 3 only on the desktop CD images.  The usual caveats apply: Python 2.7 isn't going away; it will still probably always be available in the main archive.  This release goal also doesn't affect other installation CD images, such as server, or other Ubuntu flavors.  The relatively modest goal then only affects packages for the standard desktop CD images, i.e. the alternative installation CD and the live CD.

Update 20120425: To be crystal clear,  if you depend on Python 2.7, the only thing that changes for you is that after a fresh install from the desktop CD on a new machine, you'll have to explicitly apt-get install python2.7.  After that, everything else will be the same.

This is ostensibly an effort to port a significant chunk of Ubuntu to Python 3, but it really is a much wider, Python-community driven effort.  Ubuntu has its priorities, but I personally want to see a world where Python 3 rules the day, and we can finally start scoffing at Python 2 :).

Still, that leaves us with about 145 binary packages (and many fewer source packages) to port.  There are a few categories of packages to consider:

  • Already ported and available.  This is the good news, and covers packages such as dbus-python.  Unfortunately, there aren't too many others, but we need to check with Debian and make sure we're in sync with any packages there that already support Python 3 (python3-dateutil comes to mind).
  • Upstream supports Python 3, but it is not yet available in Debian or Ubuntu.  These packages should be fairly easy to port, since we have pretty good packaging guidelines for supporting both Python 2 and Python 3.
  • Packages with better replacements for Python 3.  A good example is the python-simplejson package.  Here, we might not care as much because Python 3 already comes with a json module in its standard library, so code which depends on python-simplejson and is required for the desktop CD, should be ported to use the stdlib json module.  python-gobject is another case where porting is a better option, since pygi (gobject-introspection) already supports Python 3.
  • Canonical is the upstream.  Many packages in the archive, such as python-launchpadlib and python-lazr.restfulclient are developed upstream by Canonical.  This doesn't mean you can't or shouldn't help out with the porting of those modules, it's just that we know who to lean on as a last resort.  By all means, feel free to contribute to these too!
  • Orphaned by upstream.  These are the most problematic, since there's essentially no upstream maintainer to contribute patches to.  An example is python-oauth.  In these cases, we need to look for alternatives that are maintained upstream, and open to porting to Python 3.  In the case of python-oauth, we need to investigate oauth2, and see if there are features we're using from the abandoned package that may not be available in the supported one.
  • Unknowns.  Well, this one's the big risky part because we don't know what we don't know.
We need your help!  First of all, there's no way I can personally port everything on our list, including both libraries and applications.  We may have to make some hard choices to drop some functionality from Ubuntu if we can't get it ported, and we don't want to have to do that.  So here are some ways you can contribute:
  • Fill in the spreadsheet with more information.  If you're aware of an upstream or Debian port to Python 3, let us know.  It may make it easier for someone else to enable the Python 3 version in Debian, or to shepherd the upstream patch to landing on their trunk.
  • Help upstream make a Python 3 port available.  There are lots of resources available to help you port some code, from quick references to in-depth guides.  There's also a mailing list (and Gmane newsgroup mirror) you can join to get help, report status, and have other related discussions. Some people have asked Python 3 porting questions on StackOverflow, using the tags #python, #python-3.x, and #porting
  • Join us on the #python3 IRC channel on Freenode.
  • Subscribe to the python-porting mailing list.
  • Get packages ported in Debian.  Once upstream supports Python 3, you can extend the existing Debian package to expose this support into Debian.  From there, you or we can make sure that gets sync'd into Ubuntu.
  • Spread the word!  Even if you don't have time to do any ports yourself, you can help publicize this effort through social media, mailing lists, and your local Python community.  This really is a Python-wide effort!
Python 3.3 is scheduled to be released later this year.  Please help make 2012 the year that Python 3 reached critical mass!

 -----------------------------

On a more personal note, I am also committed to making Mailman 3 a Python 3 application, but right now I'm blocked on a number of dependencies.  Here are the list of dependencies from the setup.py file, and their statuses.  I would love it if you help get these ported too!
Of course, these are only the direct dependencies.  Others that get pulled in include:


Wednesday, January 18, 2012

Debian packaging for Python 2 and 3

Time for another installment of my ongoing mission to convert the world to Python 3!  This time, a little Debian packaging-fu for modifying an existing Python 2 package to include support for Python 3 from the same source package.

Today, I added a python3-feedparser package to Ubuntu Precise.  What's interesting about this is that, despite various reported problems, upstream feedparser 5.1 claims to support Python 3, via 2to3 conversion.  And indeed it does (although the test suite does not).

Before today, Ubuntu had feedparser 5.0.1 in its archive, and while some work has been done to update the Debian package to 5.1, this has not been released.  The uninteresting precursor to Python 3 packaging was to upgrade the Ubuntu version of the python-feedparser source package to 5.1.  I'll spare you the boring details about missing data files in the upstream tarball, and other problems, since they don't really relate to the Python 3 effort.

The first step was to verify that feedparser 5.1 works with Python 3.2 in a virtualenv, and indeed it does.  This is good news because it means that the setup.py does the right thing, which is always the best way to start supporting Python 3.  I've found that it's much easier to build a solid Debian package if you have a solid setup.py in upstream to begin with.

Now, what I'd like to do is to give you a recipe for modifying your existing debian/ directory files to add Python 3 support to a package that already exists for Python 2.  This is a little trickier for feedparser because it used an older debhelper standard, and carried some crufty old stuff in its rules file.  My first step was to update this to debhelper compatibility level 8 and greatly simplify the debian/rules file.  Here's what it might have looked like with just Python 2 support, so let's start there.


#!/usr/bin/make -f
export DH_VERBOSE=1

%:
    dh $@ --with python2

override_dh_auto_clean:
    dh_auto_clean
    rm -rf build .*egg-info

override_dh_auto_test:
ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))
    cd feedparser && python ./feedparsertest.py
else
    @echo "nocheck set, not running tests"
endif

override_dh_installdocs:
    dh_installdocs -Xtests

This is all pretty standard stuff.  dh_python2 is used (the --with python2 option to dh), and we just provide a couple of overrides for idiosyncrasies in the feedparser package.  We clean a couple of extra things that aren't cleaned automatically, and we run the test suite in the slightly non-standard way that upstream requires.  Also, we override the installation of a huge amount of test files that would otherwise get installed as documentation (they aren't docs).

So far so good.  What do we have to do to add support for Python 3?

First, we need to make a few modifications to the debian/control file.  The current convention with dh_python2 is to use an X-Python-Version header in the source package stanza, so we just need to add this header to the same stanza for Python 3:

X-Python3-Version: >= 3.2

This just says we support any Python 3 version from 3.2 onwards.  You also need to add a few additional packages to the Build-Depends.  In the feedparser case, I added the following build dependencies: python3, python3-chardet, python3-setuptools.  Even though for Python 2 there are a couple of other build dependencies (e.g. python-libxml2 and python-utidylib) these aren't available for Python 3, but lucky for us, they are optional anyway.

Next, you need to add a new binary package stanza.  There was already a python-feedparser binary package stanza for Python 2 support.  In Debian, Python 3 is provided as a separate stack, meaning packages for Python 3 will always start with the python3- prefix.  Thus, it is pretty easy to just copy the python-feedparser stanza and paste it to the bottom of debian/rules, changing the package name to python3-feedparser.  You have to update the Depends line to use ${python3:Depends} and I updated the Recommends line to name python3-chardet, and that was about it.  Here's what the new stanza looks like:


Package: python3-feedparser
Architecture: all
Depends: ${misc:Depends}, ${python3:Depends}
Recommends: python3-chardet
Description: Universal Feed Parser for Python
 Python module for downloading and parsing syndicated feeds. It can
 handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS
 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom, and CDF feeds.
 .
 It provides the same API to all formats, and sanitizes URIs and HTML.
 .
 This is the Python 3 version of the package.
Again, so far so good.  Now let's look at the debian/rules file.

The first thing to do is to add support for dh_python3, which is analogous to dh_python2, and is the only accepted helper for Python 3.  The rules line then becomes:


%:
    dh $@ --with python2,python3
Now, one problem with debhelper is that it doesn't have any built-in support for Python 3 like it does for Python 2.  This means dh will not automatically build or install any Python 3 packages, so you have to do this manually.  Eventually, this will be fixed, and fortunately with a solid setup.py file, you don't have to do to much, but it's something to be aware of.  In the feedparser case, we need to add overrides for dh_auto_build and dh_auto_install.  Here's what these rules look like:


override_dh_auto_build:
    dh_auto_build
    set -ex; for python in $(shell py3versions -r); do \
        $$python setup.py build; \
    done;

override_dh_auto_install:
    dh_auto_install
    set -ex; for python in $(shell py3versions -r); do \
        $$python setup.py install --root=$(CURDIR)/debian/tmp --install-layout=deb; \
    done;
    cp feedparser/sgmllib3.py $(CURDIR)/debian/tmp/usr/lib/python3/dist-packages/feedparser_sgmllib3.py
Not too bad, eh?  You'll notice that the first thing these rules do is call the standard dh_auto_build and dh_auto_install respectively.  This preserves the Python 2 support.  Then we just loop over all the available Python 3 versions, doing a fairly normal equivalent of setup.py install (split into a build step and an install step).  The install rule looks a little odd, but should be familiar to Debian Python hackers.  It just installs the package into the proper Debian locations, and will pretty much be the same for any Python 3 package you build.

The one odd bit is the last line in the override_dh_auto_install rule.  This is there just to work around an peculiarity in the feedparser 5.1 upstream package, where it depends on sgmllib.py, but that is no longer in the Python standard library in Python 3.  Upstream provides an already 2to3 converted version of it, and recommends you install the module as sgmllib.py somewhere on your Python 3 sys.path.  Well, I don't like the namespace pollution that would cause, so I install the file as feedparser_sgmllib3.py and add a quilt patch to the package to try an import of that module if importing sgmllib fails (as it will on Python 3).

An aside: If you look in the debian/rules file for what I actually uploaded, you'll see some additional modifications to override_dh_auto_test.  This just works around the upstream bug where some test suite data files were accidentally omitted from the release tarball.  You can pretty much ignore those lines for the purposes of this article.

We're almost done.  The last thing we need to do is make sure that debhelper installs the right files into the right binary packages.  We want the python-feedparser binary package to include only the Python 2 files, and the python3-feedparser binary package to only include the Python 3 files.  Keep in mind that when a source package builds only a single binary package (as was the case before I added Python 3 support), debhelper will include everything under the build directory's debian/tmp subdirectory in the single binary package.  That's why you see things get installed into $(CURDIR)/debian/tmp.  But when a source package builds multiple binary packages, as is now the case here, we have to tell debhelper which files go into which binary packages.  We do this by adding two new files to the debian directory: python-feedparser.install and python3-feedparser.install

Reading the manpage for dh_install will explain the reasons for this, and describe the format of the file contents.  In our case, we're really lucky, because for Python 2, everything gets installed under usr/lib/python2.* and in Python 3, everything gets installed under usr/lib/python3 (relative to $(CURDIR)/debian/tmp).  You'll notice a few things here.  Because we could be building for multiple versions of Python 2, we have to wildcard the actual directory under usr/lib, e.g. it might be python2.6 or python2.7.  But because we have PEP 3147 and PEP 3149 in Python 3.2, there's only one directory for all supported versions of Python 3, so we don't need to wildcard the subdirectory.  Also, if you look at the actual .install files in the package, you'll see a few other trailing path components, so the actual contents of the files are:


usr/lib/python2.*/*-packages/*
and


usr/lib/python3/*-packages/*
for the python-feedparser.install and python3-feedparser.install files respectively.  The trailing bits just wildcard what on a Debian system will always be dist-packages, just for safety (cargo culting FTW!).

And that really is it!  Of course, things could be a little more complicated if you have extension modules, but maybe not that much more so, and if the package you're adding Python 3 support to isn't setuptools-based, you may have more work to do even still.  The feedparser package has a few other oddities that are really unrelated to adding Python 3 support, so I'm ignoring them here, but feel free to ask for additional details in the comments, in IRC, or in email.

Hopefully this gives you some insight into how to extend an existing Python 2 Debian package into including Python 3 support, given that your upstream already supports Python 3.  Now, go forth and hack!

Addendum: my colleague Colin Watson just today packaged up Benjamin Peterson's very fine Python package called six.  This is a nice package that provides some excellent Python 2 and 3 compatibility utilities.  You may find this helpful if you're trying to support both Python 2 and Python 3 in a single code base, especially if you have to support back to Python 2.4 (poor you :).  This will be available in Ubuntu Precise, although if you're submitting patches back upstream, you may have to convince the upstream author to accept the additional dependency.  It's worth it to add a little more Python 3 love to the world.

Friday, January 6, 2012

Python 3 Porting Fun Redux

My last post on Python 3 porting got some really great responses, and I've learned a lot from the feedback I've seen.  I'm here to rather briefly outline a few additional tips and tricks that folks have sent me and that I've learned by doing other ports since then.  Please keep them coming, either in the blog comments or to me via email.  Or better yet, blog about your experiences yourself and I'll link to them from here.

One of the big lessons I'm trying to adopt is to support Python 3 in pure-Python code with a single code base.  Specifically, I'm trying to avoid using 2to3 as much as possible.  While I think 2to3 is an excellent tool that can make it easier to get started supporting both Python 2 and Python 3 from a single branch of code, it does have some disadvantages.  The biggest problem with 2to3 is that it's slow; it can take a long time to slog through your Python code, which can be a significant impediment to your development velocity.  Another 2to3 problem is that it doesn't always play nicely with other development tools, such as python setup.py test and virtualenv, and you occasionally have to write additional custom fixers for conversion that 2to3 doesn't handle.

Given that almost all the code I'm writing these days targets Python 2.6 as the minimal supported Python 2 version, 2to3 may just be unnecessary.  With my dbus-python port to Python 3, and with my own flufl packages, I'm experimenting with ignoring 2to3 and trying to write one code base for all of Python 2.6, 2.7, and 3.2.  My colleague Michael Foord has been pretty successful with this approach going back all the way to Python 2.4, so 2.6 as a minimum should be no problem!  C extensions are pretty easy because you have the C preprocessor to help you.  But it turns out that it's usually not too difficult in pure-Python either.  I've done this in my latest release of the flufl.bounce package, and intend to eliminate 2to3 in my other flufl packages soon too.

The first thing I've done is add print_function to the __future__ import in all my modules.  Previously, I was only importing unicode_literals and absolute_import.  But doctests tend to use a lot of print statements, so switching to the print() function explicitly removes one big 2to3 conversion.  Aside from having to unlearn decades of print statement muscle memory, the print() function is actually rather nice.  So my module template now looks like this (with the copyright comment block omitted):

from __future__ import absolute_import, print_function, unicode_literals

__metaclass__ = type
__all__ = [
    ]

Speaking of doctests, you really want them to have the same set of future imports as all your other code.  I'll talk more about how my own packages set up doctests later, but for now, it's useful to know that I create a doctest.DocFileSuite for every doctest in my package.  These suites all have a setup() function and Python's testing framework will call these at the appropriate time, passing in a testobj parameter.  This argument has a globs attribute which serves as the module globals for the doctest.  All you need to do to enable the future imports in your doctests is to do something like this:

def setup(testobj):
    try:
        testobj.globs['absolute_import'] = absolute_import
        testobj.globs['print_function'] = print_function
        testobj.globs['unicode_literals'] = unicode_literals
    except NameError:
        pass

The try-except really is only necessary if you keep using 2to3, since that tool will remove the future imports from all the modules it processes.  The future imports still exist in Python 3 of course, since future imports are never, ever removed.  So if you ditch 2to3, you can get rid of the try-except too.

In the latest release of flufl.bounce, I changed the API so that the detected email addresses are all explicitly bytes objects in Python 3 (and 8-bit strings in Python 2).  This caused some problems with my doctests because the repr of Python 3 bytes objects is different than the repr of 8-bit strings in Python 2.  When you print the object in Python 2, you get just the contents of the string, but when you print them in Python 3, you get the b''-prefix.

% python
Python 2.7.2+ (default, Dec 18 2011, 17:30:39)
[GCC 4.6.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print b'foo'
foo
>>>
% python3
Python 3.2.2+ (default, Dec 19 2011, 12:03:32)
[GCC 4.6.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(b'foo')
b'foo'


This means your doctest cannot be written to easily support both versions of Python when bytes/8-bit strings are used.  I use the following helper to get around this:


def print_bytes(obj):
    if bytes is not str:
        obj = repr(obj)[2:-1]
    print(obj)


Remember that in Python 2, bytes is just an alias for str so this code only gets invoked in Python 3.

Another fun bytes/8-bit-string issue is that in Python 3, bytes objects have no .format() method.  So if you're doing something like b'foo {0}'.format(obj) this will work in Python 2, but fail in Python 3. The best I've come up with for this is to use concatenation instead, or do the format using unicodes and then encode them to their bytes object (but then you have the additional fun of choosing an appropriate encoding!).

Did you know that the re module can scan either unicodes or bytes in Python 3?  The switch is made by passing in either a bytes pattern or a str pattern, and then passing in the appropriate type of object to parse.  But, if you use the r''-prefix (i.e. raw strings) for saner handling of backslashes, you've got another problem when you want to parse bytes.  Python does not support rb''-prefixes, meaning you can have either raw string literals or bytes string literals but not both.  You have to forgo one or the other, and I usually come down on the side of ditching the raw strings and suffering the pain of backslash proliferation.

Some of the code I was porting was using itertools.izip_longest(), but this doesn't exist in Python 3.  Instead you have itertools.zip_longest().  You'll have to do a conditional import (i.e. try-except) around this to get the right version.

Do you use zope.interfaces?  You'll be interested to know that the syntax we've long been accustomed to for declaring that a class implements an interface does not work in Python 3.  For example:

from zope.interface import Interface, implements
class MyInterface(Interface):
    pass
class MyClass:
    implements(MyInterface)

This is because the stack hacking that implements() uses doesn't work in Python 3.  Fortunately, the latest version of zope.interface has a new class decorator that you can use instead.  This works in Python 2.6 and 2.7 too, so change your code to use this:

from zope.interface import Interface, implementer
class MyInterface(Interface):
    pass
@implementer(MyInterface)
class MyClass:
    pass

I kind of like the use of class decorators better anyway.

Here's a tricky one.  Did you know that Python 2 provides some codecs for doing interesting conversions such as Caeser rotation (i.e. rot13)?  Thus, you can do things like:

>>> 'foo'.encode('rot-13')
'sbb'

This doesn't work in Python 3 though, because even though certain str-to-str codecs like rot-13 still exist, the str.encode() interface requires that the codec return a bytes object. In order to use str-to-str codecs in both Python 2 and Python 3, you'll have to pop the hood and use a lower-level API, getting and calling the codec directly:

>>> from codecs import getencoder
>>> encoder = getencoder('rot-13')
>>> rot13string = encoder(mystring)[0]

You have to get the zeroth-element from the return value of the encoder because of the codecs API.  A bit ugly, but it works in both versions of Python.

That's all for now.  Happy porting!