Wednesday, December 7, 2011

Lessons in porting to Python 3

Yesterday, I completed my port of dbus-python to Python 3, and submitted my patch upstream.  While I've yet to hear any feedback from Simon about my patch, I'm fairly confident that it's going in the right direction.  This version should allow existing Python 2 applications to run largely unchanged, and minimizes the differences that clients will have to make to use the Python 3 version.

Some of the changes are specific to the dbus-python project, and I included a detailed summary of those changes and my rationale behind them.  There are lots of good lessons learned during this porting exercise that I want to share with you, have a discussion about, and see if there aren't things we core Python developers can do in Python 3.3 to make it even easier to migrate to Python 3.

First, some background.  D-Bus is a freedesktop.org project for same-system interprocess communication, and it's an essential component of any Linux desktop.  The D-Bus system and C API are mature and well-defined, and there are bindings available for many programming language, Python included of course.  The existing dbus-python package is only compatible with Python 2, and most recommendations are to use the Gnome version of Python bindings should you want to use D-Bus with Python 3.  For us in Ubuntu, this isn't acceptable though because we must have a solution that supports KDE and potentially even non-UI based D-Bus Python servers.  Several ports of dbus-python to Python 3 have been attempted in the past, but none have been accepted upstream, so naturally I took it as a challenge to work on a new version of the port.  After some discussion with the upstream maintainer Simon McVittie, I had a few requirements in mind:
  • One code base for both Python 2 and Python 3.  It's simply too difficult to support multiple development branches, so one branch must be compilable in both versions of Python.  Because dbus-python is not setuptools-based, I not to rely on 2to3 to auto-convert the Python layer.  This is more difficult, but given the next requirement, entirely possible.
  • Minimum Python versions to support are 2.6 and 3.2 (Python 2.7 is also supported).  Python 2.6 contains almost everything you need to do a high quality port of both the Python layer and the C extension layer with a single code base.  Python 2.7 has one or two additional helpers, but they aren't important enough to count Python 2.6 out.  For dbus-python, this specifically means dropping support for Python 2.5, which is more than 5 years old at the time of this writing.  Also, it makes no sense to support Python 3.0 or 3.1 as neither of those are in wide-spread use.
  • Minimize any API changes seen by Python 2 code, and minimize the changes needed to port clients to Python 3.  For the former, this means everything from keeping Python APIs unchanged to keeping the inheritance hierarchy the same.  Python 2 programs will see a few small changes after the application of my patches; I'll describe them below but they should be inconsequential for the vast majority of Python 2 applications.  While it's unavoidable that Python 3 applications will see a different API, these differences have been minimized.
There are two main issues that had to be sorted out for this port, and in general for most ports to Python 3: bytes vs. strings, and ints vs. longs.  For the latter, you probably know that where Python 2 has two integer types, Python 3 has only one. In Python 3, all integers are longs, and there is no L suffix for integer literals.  This turned out to be trickier in the dbus-python case because dbus supports a numeric stack of various integer widths, and in Python 2 these are implemented as subclasses of the built-in int and long types.  Because there are only longs in Python 3, the inheritance hierarchy a Python application will see changes between Python 2 and Python 3.  This is unavoidable.

I also made the decision to change some object types to longs in both versions of Python, where I thought it was highly unlikely that Python clients would care.  Specifically, many dbus objects have a variant_level attribute, which is usually zero, but can be any positive integer.  For implementation simplicity, I changed these to longs in Python 2 also.

Ah, bytes vs. strings is always where things get interesting when porting to Python 3.  It's the single most brain hurty exercise you will have to go through.  Remember that Python 2 lets you cheat.  If you not sure whether the entity you're dealing with is some bytes, or some (usually ASCII-encoded) string, just use a Python 2 str type (a.k.a. 8-bit string) and let Python's automatic conversion rules change it to a unicode when the two types meet.  You can't get away with this in Python 3 though, for very good reasons - it's error prone, and can lead to data corruption or the annoyingly ubiquitous and hard to predict UnicodeErrors.

In Python 3, you must be clear about what are bytes and what are strings (i.e. unicodes), and you must be explicit when converting between the two.  Yes, this can be painful at times but in my opinion, it's crucial that you do so.  It's that important to eliminate UnicodeErrors that you can't defend against and your users won't understand or be able to correct.  Once you're clear in your own mind as to which are strings and which are bytes, it's usually not that hard to reflect that clearly in your code, especially if you leave Python 2.5 and anything earlier behind, which I highly recommend.

dbus-python presented an interesting challenge here.  It has several data types in its C API that are defined as UTF-8 encoded char*'s.  At first blush, it seemed to me that these should be reflected in Python 3 as bytes objects to simplify the conversion in the extension module to and from char*'s.  It turns out that this was a bad idea from an implementation stand point, and dbus-python's upstream maintainer had already expressed his opinion that these data types should be exposed as unicodes in Python 3.  After having failed at my initial attempts at making them bytes, I now agree that they must be unicodes, both for implementation simplicity and for minimal impact on porting user code.

The biggest problem I ran into with the choice of bytes is that the callback dispatch code in dbus-python is complex, difficult to understand and debug, driven by external data, and written with a deep assumption of operating on strings.  For example, when the dbus C API receives a signal, it must determine whether there is a Python function registered to handle that signal, and it does this by comparing a number of client-registered parameters, such as the method name, the interface, and the object path.  If the dbus C API was turning these parameters into bytes, but the clients had registered strings, then the comparisons in the callback dispatch routines would fail, either loudly with an exception, or silently with failing comparisons.  The former were relatively easy to track down and fix, by explicitly decoding client-registered strings to bytes.  But the latter, silent failures, were nearly impossible to debug.  Add to that the fact that there were so many roads into the registration system, that it was also very difficult to coerce all incoming data early enough so that coercion wasn't necessary at comparison time.  I was left with the unappealing alternative of forcing all client code to also change their data from using strings to using bytes, which I realized would be much too high a burden on clients porting their applications to Python 3.  Simon was right, but it was a useful exercise to fail at anyway.

(By way of comparison, it took me the better part of a week and a half to try to get the test suite passing when these objects were bytes, which I was ultimately unable to do, and about a day to get them passing when everything was unicodes.  That's gotta tell you something right there, and hopefully not that "I suck" :).

Let's look at some practical advice that may help you in your own porting efforts.

  • Target nothing older than Python 2.6 or Python 3.2.  I mentioned this before, but it's really going to make your life easier.  Specifically, drop Python 2.5 and earlier and you will thank yourself[1].  If you absolutely cannot do this, consider waiting to port to Python 3.  Note that while Python 2.7 has a few additional conveniences for supporting both Python 2 and Python 3 in a single code base, I did not find them compelling enough to drop Python 2.6 support.
  • Where you have C types with reprs, make those reprs return unicodes in both versions.  Many dbus-python types have somewhat complicated reprs because they return different strings depending on whether their variant_levels are zero or non-zero.  #ifdef'ing all of these was just too much work. Because most code probably doesn't care about the specific type of the repr, and because Python 2 allows unicode reprs, and because I have a very clever hack for this[2], I decided to make all reprs return unicodes in both versions of Python.
  • Include the following __future__ imports in your Python code: print_function, absolute_import, and unicode_literals.  In Python 2.6 and 2.7, these enable features that are the default in Python 3, and so make it easier to support both with one codebase.  Specifically, change all your print statements to print() functions, and remove all your u'' prefixes from your unicode literals.  Be sure to b'' prefix all your byte literals[3].
  • Wherever possible, in your extension modules, change all your PyInts to PyLongs.  In dbus-python, this means that the variant_level attributes are longs in both Python versions, as are values that represent such things as UNIX file descriptors.  The only place where I kept PyInts in Python 2 (and their requisite #ifdefs to use PyLongs in Python 3) was in the numeric stack inheritance hierarchy, mostly so that Python 2 code which cares about such things would not have to change.
  • Define a Python variable and a C macro for determining whether you're running in Python 2 or Python 3.  The former is used in dbus-python because under Python 3, there is no UTF8String type any more, among other subtle differences.  The latter is used to simply the #ifdef tests where they're needed[4].
  • In your C code, #include <bytesobject.h> .  This header exposes aliases for all PyString calls so that you can use the Python 3 idiom of PyBytes.  Then globally replace all PyString_Foo() calls with PyBytes_Foo() and the code will look clean and be compilable under both versions of Python.  You may need to add explicit PyUnicode calls where you need to discern between bytes and strings, but again, this code will be completely portable between Python 2 and Python 3.
  • Try to write your functions to accept both unicodes and bytes, but always normalize them to one type or the other for internal use, and choose one or the other to return.  Some Python stdlib methods are polymorphic in that they return bytes when handed bytes, and unicodes when handed unicodes.  This can be convenient in some cases, but problematic in others.  Choose carefully when porting your APIs.
  • Don't use trailing-L long literals if you can help it.
  • Switch to using Py_TYPE() everywhere instead of de-references ob_type explicitly.  The structures are laid out differently between Python 2 and Python 3, and this Python-supplied macro hides the ugliness from you.
Here are a few other miscellaneous issues you should be aware of:

Metaclasses are defined differently in Python 2 and Python 3, and you cannot write any Python code snippet that is even compilable between the two.  That's because the syntax for defining a class that derives from a metaclass in Python 3 is illegal syntax in Python 2. Your module simply won't compile.  My solution was to use exec() on a string.  For this reason, I suggest keeping metaclass subclasses as simple as possible, so that string is nice and small.


Get rid of all your uses of iteritems(), iterkeys(), itervalues(), and xrange().  You probably don't need the optimization these provide, and they do not exist in Python 3.  You can conditionalize around them, but I think in most cases it's not worth it.  If you really need the optimization, then you'll have to figure out a way around the missing names in Python 3.  But note that Python 3 is already more efficient for the first three, since you get back dictview objects instead of concrete lists.


PyArg_Parse() and friends lack a 'y' code in Python 2.  In Python 3, these return bytes objects.  Where I absolutely needed bytes in Python 3 and strs in Python 2, I just #ifdef'd around the PyArg_Parse() calls.  In Python 3, there's no equivalent of 'z' for bytes objects (which accept Nones and set the output variable to NULL in that case).  If this is important to you, you might need to write an O& converter.


Watch out for next() vs. __next__() when writing iterators.  Python 2 uses the former while Python 3 uses the latter.  Best to define the method once, and then support compatibility via `next = __next__` in your class definition.


operator.isSequenceType() is gone in Python 3.  Here's the code I use for compatibility:


def is_sequence(obj):
    try:
        from collections import Sequence
    except ImportError:
        from operator import isSequenceType
        return operator.isSequenceType(obj)
    else:
        return isinstance(obj, Sequence)


If you by chance use PyCObjects in your extension module, you'll have to switch these to PyCapsules for Python 3.  If you're lucky enough to be able to drop Python 2.6, you can use PyCapsules everywhere, since they are available in Python 2.7.


Let me close by saying that you shouldn't be frightened off by the prospect either of porting your code to Python 3, or supporting both Python 2 and Python 3 in a single code base.  It's definitely doable, and we in the Python community are gaining more experience at it every day.  I strongly feel that we are well on the track of Guido's original goal of mainstream Python 3 acceptance within 5 years of Python 3's release.  I think we're soon going to see a critical mass of Python 3 ports, after which time, you'll just seem old and creaky if you don't port to Python 3.


There are some other excellent references for helping you port out there on the 'net, and for the most part, I've tried not to duplicate their information.  Here are some useful places to start:
Enjoy!

Footnotes:
[1] It is not impossible to support both Python 3 and versions of Python 2 earlier than 2.6, just more difficult.  Michael Foord has had success doing this for libraries of his such as mock.  I just think it's more trouble than it's worth in most cases.


[2] Here's the clever hack, but first a set-up.  The reprs of many of the dbus-python objects are conditional on whether the variant_level is zero or not.  The variant_level is only included in the repr when it is greater than zero (with zero being the typical value).  This just means there are usually two calls to PyUnicode_FromFormat() in each C repr implementation, and #ifdef'ing them to use PyString_FromFormat() in Python 2 would just double the pain.  In addition, the reprs all include the repr of their parent objects, i.e. their base class repr.  The problem is that these base-class reprs will be PyBytes in Python 2 and PyUnicodes in Python 3, and there's nothing we can do about that.  As it turns out, Python 2.6 and Python 3.2 have a %V format with some very interesting semantics.  %V consumes two arguments, a PyObject* and a char*, but it only uses one of them.  When the first argument is not NULL, it uses that and ignores the second argument.  But when the first argument is NULL, it will use the second argument.


How can this help produce portable code?  I define the following macro and use this everywhere the %V format code is given:
#define REPRV(obj) \
    (PyUnicode_Check(obj) ? (obj) : NULL), \
    (PyUnicode_Check(obj) ? NULL : PyBytes_AS_STRING(obj))
which would be used at a call site something like this:

return PyUnicode_FromFormat("...%V...", REPRV(parent_repr));

In Python 2, where parent_repr is a PyBytes, REPRV() will return NULL as the first argument, and via PyBytes_AS_STRING(), a char* in the second argument. In Python 3, where parent_repr is a PyUnicode, the first argument will just be the object and the second argument will be NULL (but it is ignored by Python).  As long as parent_repr is either a PyUnicode or a PyBytes (a.k.a. PyString), this works perfectly, and keeps the call sites simple and sane.  Beware though because if parent_repr can be any other type, this will crash your program.  Fortunately, Python doesn't allow for arbitrary repr types - they must be bytes or unicodes, so in practice this is pretty safe.


[3] A recent thread in python-dev points out that this recommendation may not be practical if you're building PEP 3333-compliant WSGI applications.  My take on it is that PEP 3333's definition of "native strings" is a mistake, but sadly one that we have to live with for now.


[4] Here's what my Python-level flag looks like:
import sys
is_py3 = getattr(sys.version_info, 'major', sys.version_info[0]) == 3
Now I can use this in other code to switch behavior between Python 2 and Python 3.  For example, in dbus-python to import the UTF8String type in Python 2 only:
from dbus import is_py3
if is_py3:
   from _dbus_bindings import UTF8String
This is much easier and less error prone then doing the sys.version_info test everywhere.  The other problem is that sys.version_info is a namedtuple only in Python 2.7, so in Python 2.6, it has no attribute called 'major'.


The C-level macro looks like this:
#if PY_MAJOR_VERSION >= 3
#define PY3K
#endif
 So now C code only needs to do:

#ifdef PY3K
/* Do something Python 3-ish */
#else
/* Do something Python 2-ish */
#endif

You might also find the six package to be useful here, at least for writing portable Python code.