Some of the changes are specific to the dbus-python project, and I included a detailed summary of those changes and my rationale behind them. There are lots of good lessons learned during this porting exercise that I want to share with you, have a discussion about, and see if there aren't things we core Python developers can do in Python 3.3 to make it even easier to migrate to Python 3.
First, some background. D-Bus is a freedesktop.org project for same-system interprocess communication, and it's an essential component of any Linux desktop. The D-Bus system and C API are mature and well-defined, and there are bindings available for many programming language, Python included of course. The existing dbus-python package is only compatible with Python 2, and most recommendations are to use the Gnome version of Python bindings should you want to use D-Bus with Python 3. For us in Ubuntu, this isn't acceptable though because we must have a solution that supports KDE and potentially even non-UI based D-Bus Python servers. Several ports of dbus-python to Python 3 have been attempted in the past, but none have been accepted upstream, so naturally I took it as a challenge to work on a new version of the port. After some discussion with the upstream maintainer Simon McVittie, I had a few requirements in mind:
- One code base for both Python 2 and Python 3. It's simply too difficult to support multiple development branches, so one branch must be compilable in both versions of Python. Because dbus-python is not setuptools-based, I not to rely on 2to3 to auto-convert the Python layer. This is more difficult, but given the next requirement, entirely possible.
- Minimum Python versions to support are 2.6 and 3.2 (Python 2.7 is also supported). Python 2.6 contains almost everything you need to do a high quality port of both the Python layer and the C extension layer with a single code base. Python 2.7 has one or two additional helpers, but they aren't important enough to count Python 2.6 out. For dbus-python, this specifically means dropping support for Python 2.5, which is more than 5 years old at the time of this writing. Also, it makes no sense to support Python 3.0 or 3.1 as neither of those are in wide-spread use.
- Minimize any API changes seen by Python 2 code, and minimize the changes needed to port clients to Python 3. For the former, this means everything from keeping Python APIs unchanged to keeping the inheritance hierarchy the same. Python 2 programs will see a few small changes after the application of my patches; I'll describe them below but they should be inconsequential for the vast majority of Python 2 applications. While it's unavoidable that Python 3 applications will see a different API, these differences have been minimized.
I also made the decision to change some object types to longs in both versions of Python, where I thought it was highly unlikely that Python clients would care. Specifically, many dbus objects have a variant_level attribute, which is usually zero, but can be any positive integer. For implementation simplicity, I changed these to longs in Python 2 also.
Ah, bytes vs. strings is always where things get interesting when porting to Python 3. It's the single most brain hurty exercise you will have to go through. Remember that Python 2 lets you cheat. If you not sure whether the entity you're dealing with is some bytes, or some (usually ASCII-encoded) string, just use a Python 2 str type (a.k.a. 8-bit string) and let Python's automatic conversion rules change it to a unicode when the two types meet. You can't get away with this in Python 3 though, for very good reasons - it's error prone, and can lead to data corruption or the annoyingly ubiquitous and hard to predict UnicodeErrors.
In Python 3, you must be clear about what are bytes and what are strings (i.e. unicodes), and you must be explicit when converting between the two. Yes, this can be painful at times but in my opinion, it's crucial that you do so. It's that important to eliminate UnicodeErrors that you can't defend against and your users won't understand or be able to correct. Once you're clear in your own mind as to which are strings and which are bytes, it's usually not that hard to reflect that clearly in your code, especially if you leave Python 2.5 and anything earlier behind, which I highly recommend.
dbus-python presented an interesting challenge here. It has several data types in its C API that are defined as UTF-8 encoded char*'s. At first blush, it seemed to me that these should be reflected in Python 3 as bytes objects to simplify the conversion in the extension module to and from char*'s. It turns out that this was a bad idea from an implementation stand point, and dbus-python's upstream maintainer had already expressed his opinion that these data types should be exposed as unicodes in Python 3. After having failed at my initial attempts at making them bytes, I now agree that they must be unicodes, both for implementation simplicity and for minimal impact on porting user code.
The biggest problem I ran into with the choice of bytes is that the callback dispatch code in dbus-python is complex, difficult to understand and debug, driven by external data, and written with a deep assumption of operating on strings. For example, when the dbus C API receives a signal, it must determine whether there is a Python function registered to handle that signal, and it does this by comparing a number of client-registered parameters, such as the method name, the interface, and the object path. If the dbus C API was turning these parameters into bytes, but the clients had registered strings, then the comparisons in the callback dispatch routines would fail, either loudly with an exception, or silently with failing comparisons. The former were relatively easy to track down and fix, by explicitly decoding client-registered strings to bytes. But the latter, silent failures, were nearly impossible to debug. Add to that the fact that there were so many roads into the registration system, that it was also very difficult to coerce all incoming data early enough so that coercion wasn't necessary at comparison time. I was left with the unappealing alternative of forcing all client code to also change their data from using strings to using bytes, which I realized would be much too high a burden on clients porting their applications to Python 3. Simon was right, but it was a useful exercise to fail at anyway.
(By way of comparison, it took me the better part of a week and a half to try to get the test suite passing when these objects were bytes, which I was ultimately unable to do, and about a day to get them passing when everything was unicodes. That's gotta tell you something right there, and hopefully not that "I suck" :).
Let's look at some practical advice that may help you in your own porting efforts.
- Target nothing older than Python 2.6 or Python 3.2. I mentioned this before, but it's really going to make your life easier. Specifically, drop Python 2.5 and earlier and you will thank yourself[1]. If you absolutely cannot do this, consider waiting to port to Python 3. Note that while Python 2.7 has a few additional conveniences for supporting both Python 2 and Python 3 in a single code base, I did not find them compelling enough to drop Python 2.6 support.
- Where you have C types with reprs, make those reprs return unicodes in both versions. Many dbus-python types have somewhat complicated reprs because they return different strings depending on whether their variant_levels are zero or non-zero. #ifdef'ing all of these was just too much work. Because most code probably doesn't care about the specific type of the repr, and because Python 2 allows unicode reprs, and because I have a very clever hack for this[2], I decided to make all reprs return unicodes in both versions of Python.
- Include the following __future__ imports in your Python code: print_function, absolute_import, and unicode_literals. In Python 2.6 and 2.7, these enable features that are the default in Python 3, and so make it easier to support both with one codebase. Specifically, change all your print statements to print() functions, and remove all your u'' prefixes from your unicode literals. Be sure to b'' prefix all your byte literals[3].
- Wherever possible, in your extension modules, change all your PyInts to PyLongs. In dbus-python, this means that the variant_level attributes are longs in both Python versions, as are values that represent such things as UNIX file descriptors. The only place where I kept PyInts in Python 2 (and their requisite #ifdefs to use PyLongs in Python 3) was in the numeric stack inheritance hierarchy, mostly so that Python 2 code which cares about such things would not have to change.
- Define a Python variable and a C macro for determining whether you're running in Python 2 or Python 3. The former is used in dbus-python because under Python 3, there is no UTF8String type any more, among other subtle differences. The latter is used to simply the #ifdef tests where they're needed[4].
- In your C code, #include <bytesobject.h>
. This header exposes aliases for all PyString calls so that you can use the Python 3 idiom of PyBytes. Then globally replace all PyString_Foo() calls with PyBytes_Foo() and the code will look clean and be compilable under both versions of Python. You may need to add explicit PyUnicode calls where you need to discern between bytes and strings, but again, this code will be completely portable between Python 2 and Python 3. Try to write your functions to accept both unicodes and bytes, but always normalize them to one type or the other for internal use, and choose one or the other to return. Some Python stdlib methods are polymorphic in that they return bytes when handed bytes, and unicodes when handed unicodes. This can be convenient in some cases, but problematic in others. Choose carefully when porting your APIs. Don't use trailing-L long literals if you can help it. Switch to using Py_TYPE() everywhere instead of de-references ob_type explicitly. The structures are laid out differently between Python 2 and Python 3, and this Python-supplied macro hides the ugliness from you.
def is_sequence(obj):
try:
from collections import Sequence
except ImportError:
from operator import isSequenceType
return operator.isSequenceType(obj)
else:
return isinstance(obj, Sequence)
http://python3porting.com/toc.html http://docs.python.org/howto/cporting.html http://docs.python.org/py3k/c-api/index.html
#define REPRV(obj) \
(PyUnicode_Check(obj) ? (obj) : NULL), \
(PyUnicode_Check(obj) ? NULL : PyBytes_AS_STRING(obj))
return PyUnicode_FromFormat("...%V...", REPRV(parent_repr));
import sys is_py3 = getattr(sys.version_info, 'major', sys.version_info[0]) == 3
from dbus import is_py3
if is_py3:
from _dbus_bindings import UTF8String
#if PY_MAJOR_VERSION >= 3
#define PY3K
#endif
#ifdef PY3K /* Do something Python 3-ish */ #else /* Do something Python 2-ish */ #endif
You might also find the six package to be useful here, at least for writing portable Python code.
