Tuesday, April 24, 2012

Python 3 on the desktop for Quantal Quetzal

So, now all the world now knows that my suggested code name for Ubuntu 12.10, Qwazy Quahog, was not chosen by Mark.  Oh well, maybe I'll have more luck with Racy Roadrunner.

In any case, Ubuntu 12.04 LTS is to be released any day now so it's time for my semi-annual report on Python plans for Ubuntu.  I seem to write about this every cycle, so 12.10 is no exception.  We've made some fantastic progress, but now it's time to get serious.

For Ubuntu 12.10, we've made it a release goal to have Python 3 only on the desktop CD images.  The usual caveats apply: Python 2.7 isn't going away; it will still probably always be available in the main archive.  This release goal also doesn't affect other installation CD images, such as server, or other Ubuntu flavors.  The relatively modest goal then only affects packages for the standard desktop CD images, i.e. the alternative installation CD and the live CD.

Update 20120425: To be crystal clear,  if you depend on Python 2.7, the only thing that changes for you is that after a fresh install from the desktop CD on a new machine, you'll have to explicitly apt-get install python2.7.  After that, everything else will be the same.

This is ostensibly an effort to port a significant chunk of Ubuntu to Python 3, but it really is a much wider, Python-community driven effort.  Ubuntu has its priorities, but I personally want to see a world where Python 3 rules the day, and we can finally start scoffing at Python 2 :).

Still, that leaves us with about 145 binary packages (and many fewer source packages) to port.  There are a few categories of packages to consider:

  • Already ported and available.  This is the good news, and covers packages such as dbus-python.  Unfortunately, there aren't too many others, but we need to check with Debian and make sure we're in sync with any packages there that already support Python 3 (python3-dateutil comes to mind).
  • Upstream supports Python 3, but it is not yet available in Debian or Ubuntu.  These packages should be fairly easy to port, since we have pretty good packaging guidelines for supporting both Python 2 and Python 3.
  • Packages with better replacements for Python 3.  A good example is the python-simplejson package.  Here, we might not care as much because Python 3 already comes with a json module in its standard library, so code which depends on python-simplejson and is required for the desktop CD, should be ported to use the stdlib json module.  python-gobject is another case where porting is a better option, since pygi (gobject-introspection) already supports Python 3.
  • Canonical is the upstream.  Many packages in the archive, such as python-launchpadlib and python-lazr.restfulclient are developed upstream by Canonical.  This doesn't mean you can't or shouldn't help out with the porting of those modules, it's just that we know who to lean on as a last resort.  By all means, feel free to contribute to these too!
  • Orphaned by upstream.  These are the most problematic, since there's essentially no upstream maintainer to contribute patches to.  An example is python-oauth.  In these cases, we need to look for alternatives that are maintained upstream, and open to porting to Python 3.  In the case of python-oauth, we need to investigate oauth2, and see if there are features we're using from the abandoned package that may not be available in the supported one.
  • Unknowns.  Well, this one's the big risky part because we don't know what we don't know.
We need your help!  First of all, there's no way I can personally port everything on our list, including both libraries and applications.  We may have to make some hard choices to drop some functionality from Ubuntu if we can't get it ported, and we don't want to have to do that.  So here are some ways you can contribute:
  • Fill in the spreadsheet with more information.  If you're aware of an upstream or Debian port to Python 3, let us know.  It may make it easier for someone else to enable the Python 3 version in Debian, or to shepherd the upstream patch to landing on their trunk.
  • Help upstream make a Python 3 port available.  There are lots of resources available to help you port some code, from quick references to in-depth guides.  There's also a mailing list (and Gmane newsgroup mirror) you can join to get help, report status, and have other related discussions. Some people have asked Python 3 porting questions on StackOverflow, using the tags #python, #python-3.x, and #porting
  • Join us on the #python3 IRC channel on Freenode.
  • Subscribe to the python-porting mailing list.
  • Get packages ported in Debian.  Once upstream supports Python 3, you can extend the existing Debian package to expose this support into Debian.  From there, you or we can make sure that gets sync'd into Ubuntu.
  • Spread the word!  Even if you don't have time to do any ports yourself, you can help publicize this effort through social media, mailing lists, and your local Python community.  This really is a Python-wide effort!
Python 3.3 is scheduled to be released later this year.  Please help make 2012 the year that Python 3 reached critical mass!

 -----------------------------

On a more personal note, I am also committed to making Mailman 3 a Python 3 application, but right now I'm blocked on a number of dependencies.  Here are the list of dependencies from the setup.py file, and their statuses.  I would love it if you help get these ported too!
Of course, these are only the direct dependencies.  Others that get pulled in include:


Wednesday, January 18, 2012

Debian packaging for Python 2 and 3

Time for another installment of my ongoing mission to convert the world to Python 3!  This time, a little Debian packaging-fu for modifying an existing Python 2 package to include support for Python 3 from the same source package.

Today, I added a python3-feedparser package to Ubuntu Precise.  What's interesting about this is that, despite various reported problems, upstream feedparser 5.1 claims to support Python 3, via 2to3 conversion.  And indeed it does (although the test suite does not).

Before today, Ubuntu had feedparser 5.0.1 in its archive, and while some work has been done to update the Debian package to 5.1, this has not been released.  The uninteresting precursor to Python 3 packaging was to upgrade the Ubuntu version of the python-feedparser source package to 5.1.  I'll spare you the boring details about missing data files in the upstream tarball, and other problems, since they don't really relate to the Python 3 effort.

The first step was to verify that feedparser 5.1 works with Python 3.2 in a virtualenv, and indeed it does.  This is good news because it means that the setup.py does the right thing, which is always the best way to start supporting Python 3.  I've found that it's much easier to build a solid Debian package if you have a solid setup.py in upstream to begin with.

Now, what I'd like to do is to give you a recipe for modifying your existing debian/ directory files to add Python 3 support to a package that already exists for Python 2.  This is a little trickier for feedparser because it used an older debhelper standard, and carried some crufty old stuff in its rules file.  My first step was to update this to debhelper compatibility level 8 and greatly simplify the debian/rules file.  Here's what it might have looked like with just Python 2 support, so let's start there.


#!/usr/bin/make -f
export DH_VERBOSE=1

%:
    dh $@ --with python2

override_dh_auto_clean:
    dh_auto_clean
    rm -rf build .*egg-info

override_dh_auto_test:
ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))
    cd feedparser && python ./feedparsertest.py
else
    @echo "nocheck set, not running tests"
endif

override_dh_installdocs:
    dh_installdocs -Xtests

This is all pretty standard stuff.  dh_python2 is used (the --with python2 option to dh), and we just provide a couple of overrides for idiosyncrasies in the feedparser package.  We clean a couple of extra things that aren't cleaned automatically, and we run the test suite in the slightly non-standard way that upstream requires.  Also, we override the installation of a huge amount of test files that would otherwise get installed as documentation (they aren't docs).

So far so good.  What do we have to do to add support for Python 3?

First, we need to make a few modifications to the debian/control file.  The current convention with dh_python2 is to use an X-Python-Version header in the source package stanza, so we just need to add this header to the same stanza for Python 3:

X-Python3-Version: >= 3.2

This just says we support any Python 3 version from 3.2 onwards.  You also need to add a few additional packages to the Build-Depends.  In the feedparser case, I added the following build dependencies: python3, python3-chardet, python3-setuptools.  Even though for Python 2 there are a couple of other build dependencies (e.g. python-libxml2 and python-utidylib) these aren't available for Python 3, but lucky for us, they are optional anyway.

Next, you need to add a new binary package stanza.  There was already a python-feedparser binary package stanza for Python 2 support.  In Debian, Python 3 is provided as a separate stack, meaning packages for Python 3 will always start with the python3- prefix.  Thus, it is pretty easy to just copy the python-feedparser stanza and paste it to the bottom of debian/rules, changing the package name to python3-feedparser.  You have to update the Depends line to use ${python3:Depends} and I updated the Recommends line to name python3-chardet, and that was about it.  Here's what the new stanza looks like:


Package: python3-feedparser
Architecture: all
Depends: ${misc:Depends}, ${python3:Depends}
Recommends: python3-chardet
Description: Universal Feed Parser for Python
 Python module for downloading and parsing syndicated feeds. It can
 handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS
 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom, and CDF feeds.
 .
 It provides the same API to all formats, and sanitizes URIs and HTML.
 .
 This is the Python 3 version of the package.
Again, so far so good.  Now let's look at the debian/rules file.

The first thing to do is to add support for dh_python3, which is analogous to dh_python2, and is the only accepted helper for Python 3.  The rules line then becomes:


%:
    dh $@ --with python2,python3
Now, one problem with debhelper is that it doesn't have any built-in support for Python 3 like it does for Python 2.  This means dh will not automatically build or install any Python 3 packages, so you have to do this manually.  Eventually, this will be fixed, and fortunately with a solid setup.py file, you don't have to do to much, but it's something to be aware of.  In the feedparser case, we need to add overrides for dh_auto_build and dh_auto_install.  Here's what these rules look like:


override_dh_auto_build:
    dh_auto_build
    set -ex; for python in $(shell py3versions -r); do \
        $$python setup.py build; \
    done;

override_dh_auto_install:
    dh_auto_install
    set -ex; for python in $(shell py3versions -r); do \
        $$python setup.py install --root=$(CURDIR)/debian/tmp --install-layout=deb; \
    done;
    cp feedparser/sgmllib3.py $(CURDIR)/debian/tmp/usr/lib/python3/dist-packages/feedparser_sgmllib3.py
Not too bad, eh?  You'll notice that the first thing these rules do is call the standard dh_auto_build and dh_auto_install respectively.  This preserves the Python 2 support.  Then we just loop over all the available Python 3 versions, doing a fairly normal equivalent of setup.py install (split into a build step and an install step).  The install rule looks a little odd, but should be familiar to Debian Python hackers.  It just installs the package into the proper Debian locations, and will pretty much be the same for any Python 3 package you build.

The one odd bit is the last line in the override_dh_auto_install rule.  This is there just to work around an peculiarity in the feedparser 5.1 upstream package, where it depends on sgmllib.py, but that is no longer in the Python standard library in Python 3.  Upstream provides an already 2to3 converted version of it, and recommends you install the module as sgmllib.py somewhere on your Python 3 sys.path.  Well, I don't like the namespace pollution that would cause, so I install the file as feedparser_sgmllib3.py and add a quilt patch to the package to try an import of that module if importing sgmllib fails (as it will on Python 3).

An aside: If you look in the debian/rules file for what I actually uploaded, you'll see some additional modifications to override_dh_auto_test.  This just works around the upstream bug where some test suite data files were accidentally omitted from the release tarball.  You can pretty much ignore those lines for the purposes of this article.

We're almost done.  The last thing we need to do is make sure that debhelper installs the right files into the right binary packages.  We want the python-feedparser binary package to include only the Python 2 files, and the python3-feedparser binary package to only include the Python 3 files.  Keep in mind that when a source package builds only a single binary package (as was the case before I added Python 3 support), debhelper will include everything under the build directory's debian/tmp subdirectory in the single binary package.  That's why you see things get installed into $(CURDIR)/debian/tmp.  But when a source package builds multiple binary packages, as is now the case here, we have to tell debhelper which files go into which binary packages.  We do this by adding two new files to the debian directory: python-feedparser.install and python3-feedparser.install

Reading the manpage for dh_install will explain the reasons for this, and describe the format of the file contents.  In our case, we're really lucky, because for Python 2, everything gets installed under usr/lib/python2.* and in Python 3, everything gets installed under usr/lib/python3 (relative to $(CURDIR)/debian/tmp).  You'll notice a few things here.  Because we could be building for multiple versions of Python 2, we have to wildcard the actual directory under usr/lib, e.g. it might be python2.6 or python2.7.  But because we have PEP 3147 and PEP 3149 in Python 3.2, there's only one directory for all supported versions of Python 3, so we don't need to wildcard the subdirectory.  Also, if you look at the actual .install files in the package, you'll see a few other trailing path components, so the actual contents of the files are:


usr/lib/python2.*/*-packages/*
and


usr/lib/python3/*-packages/*
for the python-feedparser.install and python3-feedparser.install files respectively.  The trailing bits just wildcard what on a Debian system will always be dist-packages, just for safety (cargo culting FTW!).

And that really is it!  Of course, things could be a little more complicated if you have extension modules, but maybe not that much more so, and if the package you're adding Python 3 support to isn't setuptools-based, you may have more work to do even still.  The feedparser package has a few other oddities that are really unrelated to adding Python 3 support, so I'm ignoring them here, but feel free to ask for additional details in the comments, in IRC, or in email.

Hopefully this gives you some insight into how to extend an existing Python 2 Debian package into including Python 3 support, given that your upstream already supports Python 3.  Now, go forth and hack!

Addendum: my colleague Colin Watson just today packaged up Benjamin Peterson's very fine Python package called six.  This is a nice package that provides some excellent Python 2 and 3 compatibility utilities.  You may find this helpful if you're trying to support both Python 2 and Python 3 in a single code base, especially if you have to support back to Python 2.4 (poor you :).  This will be available in Ubuntu Precise, although if you're submitting patches back upstream, you may have to convince the upstream author to accept the additional dependency.  It's worth it to add a little more Python 3 love to the world.

Friday, January 6, 2012

Python 3 Porting Fun Redux

My last post on Python 3 porting got some really great responses, and I've learned a lot from the feedback I've seen.  I'm here to rather briefly outline a few additional tips and tricks that folks have sent me and that I've learned by doing other ports since then.  Please keep them coming, either in the blog comments or to me via email.  Or better yet, blog about your experiences yourself and I'll link to them from here.

One of the big lessons I'm trying to adopt is to support Python 3 in pure-Python code with a single code base.  Specifically, I'm trying to avoid using 2to3 as much as possible.  While I think 2to3 is an excellent tool that can make it easier to get started supporting both Python 2 and Python 3 from a single branch of code, it does have some disadvantages.  The biggest problem with 2to3 is that it's slow; it can take a long time to slog through your Python code, which can be a significant impediment to your development velocity.  Another 2to3 problem is that it doesn't always play nicely with other development tools, such as python setup.py test and virtualenv, and you occasionally have to write additional custom fixers for conversion that 2to3 doesn't handle.

Given that almost all the code I'm writing these days targets Python 2.6 as the minimal supported Python 2 version, 2to3 may just be unnecessary.  With my dbus-python port to Python 3, and with my own flufl packages, I'm experimenting with ignoring 2to3 and trying to write one code base for all of Python 2.6, 2.7, and 3.2.  My colleague Michael Foord has been pretty successful with this approach going back all the way to Python 2.4, so 2.6 as a minimum should be no problem!  C extensions are pretty easy because you have the C preprocessor to help you.  But it turns out that it's usually not too difficult in pure-Python either.  I've done this in my latest release of the flufl.bounce package, and intend to eliminate 2to3 in my other flufl packages soon too.

The first thing I've done is add print_function to the __future__ import in all my modules.  Previously, I was only importing unicode_literals and absolute_import.  But doctests tend to use a lot of print statements, so switching to the print() function explicitly removes one big 2to3 conversion.  Aside from having to unlearn decades of print statement muscle memory, the print() function is actually rather nice.  So my module template now looks like this (with the copyright comment block omitted):

from __future__ import absolute_import, print_function, unicode_literals

__metaclass__ = type
__all__ = [
    ]

Speaking of doctests, you really want them to have the same set of future imports as all your other code.  I'll talk more about how my own packages set up doctests later, but for now, it's useful to know that I create a doctest.DocFileSuite for every doctest in my package.  These suites all have a setup() function and Python's testing framework will call these at the appropriate time, passing in a testobj parameter.  This argument has a globs attribute which serves as the module globals for the doctest.  All you need to do to enable the future imports in your doctests is to do something like this:

def setup(testobj):
    try:
        testobj.globs['absolute_import'] = absolute_import
        testobj.globs['print_function'] = print_function
        testobj.globs['unicode_literals'] = unicode_literals
    except NameError:
        pass

The try-except really is only necessary if you keep using 2to3, since that tool will remove the future imports from all the modules it processes.  The future imports still exist in Python 3 of course, since future imports are never, ever removed.  So if you ditch 2to3, you can get rid of the try-except too.

In the latest release of flufl.bounce, I changed the API so that the detected email addresses are all explicitly bytes objects in Python 3 (and 8-bit strings in Python 2).  This caused some problems with my doctests because the repr of Python 3 bytes objects is different than the repr of 8-bit strings in Python 2.  When you print the object in Python 2, you get just the contents of the string, but when you print them in Python 3, you get the b''-prefix.

% python
Python 2.7.2+ (default, Dec 18 2011, 17:30:39)
[GCC 4.6.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print b'foo'
foo
>>>
% python3
Python 3.2.2+ (default, Dec 19 2011, 12:03:32)
[GCC 4.6.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(b'foo')
b'foo'


This means your doctest cannot be written to easily support both versions of Python when bytes/8-bit strings are used.  I use the following helper to get around this:


def print_bytes(obj):
    if bytes is not str:
        obj = repr(obj)[2:-1]
    print(obj)


Remember that in Python 2, bytes is just an alias for str so this code only gets invoked in Python 3.

Another fun bytes/8-bit-string issue is that in Python 3, bytes objects have no .format() method.  So if you're doing something like b'foo {0}'.format(obj) this will work in Python 2, but fail in Python 3. The best I've come up with for this is to use concatenation instead, or do the format using unicodes and then encode them to their bytes object (but then you have the additional fun of choosing an appropriate encoding!).

Did you know that the re module can scan either unicodes or bytes in Python 3?  The switch is made by passing in either a bytes pattern or a str pattern, and then passing in the appropriate type of object to parse.  But, if you use the r''-prefix (i.e. raw strings) for saner handling of backslashes, you've got another problem when you want to parse bytes.  Python does not support rb''-prefixes, meaning you can have either raw string literals or bytes string literals but not both.  You have to forgo one or the other, and I usually come down on the side of ditching the raw strings and suffering the pain of backslash proliferation.

Some of the code I was porting was using itertools.izip_longest(), but this doesn't exist in Python 3.  Instead you have itertools.zip_longest().  You'll have to do a conditional import (i.e. try-except) around this to get the right version.

Do you use zope.interfaces?  You'll be interested to know that the syntax we've long been accustomed to for declaring that a class implements an interface does not work in Python 3.  For example:

from zope.interface import Interface, implements
class MyInterface(Interface):
    pass
class MyClass:
    implements(MyInterface)

This is because the stack hacking that implements() uses doesn't work in Python 3.  Fortunately, the latest version of zope.interface has a new class decorator that you can use instead.  This works in Python 2.6 and 2.7 too, so change your code to use this:

from zope.interface import Interface, implementer
class MyInterface(Interface):
    pass
@implementer(MyInterface)
class MyClass:
    pass

I kind of like the use of class decorators better anyway.

Here's a tricky one.  Did you know that Python 2 provides some codecs for doing interesting conversions such as Caeser rotation (i.e. rot13)?  Thus, you can do things like:

>>> 'foo'.encode('rot-13')
'sbb'

This doesn't work in Python 3 though, because even though certain str-to-str codecs like rot-13 still exist, the str.encode() interface requires that the codec return a bytes object. In order to use str-to-str codecs in both Python 2 and Python 3, you'll have to pop the hood and use a lower-level API, getting and calling the codec directly:

>>> from codecs import getencoder
>>> encoder = getencoder('rot-13')
>>> rot13string = encoder(mystring)[0]

You have to get the zeroth-element from the return value of the encoder because of the codecs API.  A bit ugly, but it works in both versions of Python.

That's all for now.  Happy porting!

Wednesday, December 7, 2011

Lessons in porting to Python 3

Yesterday, I completed my port of dbus-python to Python 3, and submitted my patch upstream.  While I've yet to hear any feedback from Simon about my patch, I'm fairly confident that it's going in the right direction.  This version should allow existing Python 2 applications to run largely unchanged, and minimizes the differences that clients will have to make to use the Python 3 version.

Some of the changes are specific to the dbus-python project, and I included a detailed summary of those changes and my rationale behind them.  There are lots of good lessons learned during this porting exercise that I want to share with you, have a discussion about, and see if there aren't things we core Python developers can do in Python 3.3 to make it even easier to migrate to Python 3.

First, some background.  D-Bus is a freedesktop.org project for same-system interprocess communication, and it's an essential component of any Linux desktop.  The D-Bus system and C API are mature and well-defined, and there are bindings available for many programming language, Python included of course.  The existing dbus-python package is only compatible with Python 2, and most recommendations are to use the Gnome version of Python bindings should you want to use D-Bus with Python 3.  For us in Ubuntu, this isn't acceptable though because we must have a solution that supports KDE and potentially even non-UI based D-Bus Python servers.  Several ports of dbus-python to Python 3 have been attempted in the past, but none have been accepted upstream, so naturally I took it as a challenge to work on a new version of the port.  After some discussion with the upstream maintainer Simon McVittie, I had a few requirements in mind:
  • One code base for both Python 2 and Python 3.  It's simply too difficult to support multiple development branches, so one branch must be compilable in both versions of Python.  Because dbus-python is not setuptools-based, I not to rely on 2to3 to auto-convert the Python layer.  This is more difficult, but given the next requirement, entirely possible.
  • Minimum Python versions to support are 2.6 and 3.2 (Python 2.7 is also supported).  Python 2.6 contains almost everything you need to do a high quality port of both the Python layer and the C extension layer with a single code base.  Python 2.7 has one or two additional helpers, but they aren't important enough to count Python 2.6 out.  For dbus-python, this specifically means dropping support for Python 2.5, which is more than 5 years old at the time of this writing.  Also, it makes no sense to support Python 3.0 or 3.1 as neither of those are in wide-spread use.
  • Minimize any API changes seen by Python 2 code, and minimize the changes needed to port clients to Python 3.  For the former, this means everything from keeping Python APIs unchanged to keeping the inheritance hierarchy the same.  Python 2 programs will see a few small changes after the application of my patches; I'll describe them below but they should be inconsequential for the vast majority of Python 2 applications.  While it's unavoidable that Python 3 applications will see a different API, these differences have been minimized.
There are two main issues that had to be sorted out for this port, and in general for most ports to Python 3: bytes vs. strings, and ints vs. longs.  For the latter, you probably know that where Python 2 has two integer types, Python 3 has only one. In Python 3, all integers are longs, and there is no L suffix for integer literals.  This turned out to be trickier in the dbus-python case because dbus supports a numeric stack of various integer widths, and in Python 2 these are implemented as subclasses of the built-in int and long types.  Because there are only longs in Python 3, the inheritance hierarchy a Python application will see changes between Python 2 and Python 3.  This is unavoidable.

I also made the decision to change some object types to longs in both versions of Python, where I thought it was highly unlikely that Python clients would care.  Specifically, many dbus objects have a variant_level attribute, which is usually zero, but can be any positive integer.  For implementation simplicity, I changed these to longs in Python 2 also.

Ah, bytes vs. strings is always where things get interesting when porting to Python 3.  It's the single most brain hurty exercise you will have to go through.  Remember that Python 2 lets you cheat.  If you not sure whether the entity you're dealing with is some bytes, or some (usually ASCII-encoded) string, just use a Python 2 str type (a.k.a. 8-bit string) and let Python's automatic conversion rules change it to a unicode when the two types meet.  You can't get away with this in Python 3 though, for very good reasons - it's error prone, and can lead to data corruption or the annoyingly ubiquitous and hard to predict UnicodeErrors.

In Python 3, you must be clear about what are bytes and what are strings (i.e. unicodes), and you must be explicit when converting between the two.  Yes, this can be painful at times but in my opinion, it's crucial that you do so.  It's that important to eliminate UnicodeErrors that you can't defend against and your users won't understand or be able to correct.  Once you're clear in your own mind as to which are strings and which are bytes, it's usually not that hard to reflect that clearly in your code, especially if you leave Python 2.5 and anything earlier behind, which I highly recommend.

dbus-python presented an interesting challenge here.  It has several data types in its C API that are defined as UTF-8 encoded char*'s.  At first blush, it seemed to me that these should be reflected in Python 3 as bytes objects to simplify the conversion in the extension module to and from char*'s.  It turns out that this was a bad idea from an implementation stand point, and dbus-python's upstream maintainer had already expressed his opinion that these data types should be exposed as unicodes in Python 3.  After having failed at my initial attempts at making them bytes, I now agree that they must be unicodes, both for implementation simplicity and for minimal impact on porting user code.

The biggest problem I ran into with the choice of bytes is that the callback dispatch code in dbus-python is complex, difficult to understand and debug, driven by external data, and written with a deep assumption of operating on strings.  For example, when the dbus C API receives a signal, it must determine whether there is a Python function registered to handle that signal, and it does this by comparing a number of client-registered parameters, such as the method name, the interface, and the object path.  If the dbus C API was turning these parameters into bytes, but the clients had registered strings, then the comparisons in the callback dispatch routines would fail, either loudly with an exception, or silently with failing comparisons.  The former were relatively easy to track down and fix, by explicitly decoding client-registered strings to bytes.  But the latter, silent failures, were nearly impossible to debug.  Add to that the fact that there were so many roads into the registration system, that it was also very difficult to coerce all incoming data early enough so that coercion wasn't necessary at comparison time.  I was left with the unappealing alternative of forcing all client code to also change their data from using strings to using bytes, which I realized would be much too high a burden on clients porting their applications to Python 3.  Simon was right, but it was a useful exercise to fail at anyway.

(By way of comparison, it took me the better part of a week and a half to try to get the test suite passing when these objects were bytes, which I was ultimately unable to do, and about a day to get them passing when everything was unicodes.  That's gotta tell you something right there, and hopefully not that "I suck" :).

Let's look at some practical advice that may help you in your own porting efforts.

  • Target nothing older than Python 2.6 or Python 3.2.  I mentioned this before, but it's really going to make your life easier.  Specifically, drop Python 2.5 and earlier and you will thank yourself[1].  If you absolutely cannot do this, consider waiting to port to Python 3.  Note that while Python 2.7 has a few additional conveniences for supporting both Python 2 and Python 3 in a single code base, I did not find them compelling enough to drop Python 2.6 support.
  • Where you have C types with reprs, make those reprs return unicodes in both versions.  Many dbus-python types have somewhat complicated reprs because they return different strings depending on whether their variant_levels are zero or non-zero.  #ifdef'ing all of these was just too much work. Because most code probably doesn't care about the specific type of the repr, and because Python 2 allows unicode reprs, and because I have a very clever hack for this[2], I decided to make all reprs return unicodes in both versions of Python.
  • Include the following __future__ imports in your Python code: print_function, absolute_import, and unicode_literals.  In Python 2.6 and 2.7, these enable features that are the default in Python 3, and so make it easier to support both with one codebase.  Specifically, change all your print statements to print() functions, and remove all your u'' prefixes from your unicode literals.  Be sure to b'' prefix all your byte literals[3].
  • Wherever possible, in your extension modules, change all your PyInts to PyLongs.  In dbus-python, this means that the variant_level attributes are longs in both Python versions, as are values that represent such things as UNIX file descriptors.  The only place where I kept PyInts in Python 2 (and their requisite #ifdefs to use PyLongs in Python 3) was in the numeric stack inheritance hierarchy, mostly so that Python 2 code which cares about such things would not have to change.
  • Define a Python variable and a C macro for determining whether you're running in Python 2 or Python 3.  The former is used in dbus-python because under Python 3, there is no UTF8String type any more, among other subtle differences.  The latter is used to simply the #ifdef tests where they're needed[4].
  • In your C code, #include <bytesobject.h> .  This header exposes aliases for all PyString calls so that you can use the Python 3 idiom of PyBytes.  Then globally replace all PyString_Foo() calls with PyBytes_Foo() and the code will look clean and be compilable under both versions of Python.  You may need to add explicit PyUnicode calls where you need to discern between bytes and strings, but again, this code will be completely portable between Python 2 and Python 3.
  • Try to write your functions to accept both unicodes and bytes, but always normalize them to one type or the other for internal use, and choose one or the other to return.  Some Python stdlib methods are polymorphic in that they return bytes when handed bytes, and unicodes when handed unicodes.  This can be convenient in some cases, but problematic in others.  Choose carefully when porting your APIs.
  • Don't use trailing-L long literals if you can help it.
  • Switch to using Py_TYPE() everywhere instead of de-references ob_type explicitly.  The structures are laid out differently between Python 2 and Python 3, and this Python-supplied macro hides the ugliness from you.
Here are a few other miscellaneous issues you should be aware of:

Metaclasses are defined differently in Python 2 and Python 3, and you cannot write any Python code snippet that is even compilable between the two.  That's because the syntax for defining a class that derives from a metaclass in Python 3 is illegal syntax in Python 2. Your module simply won't compile.  My solution was to use exec() on a string.  For this reason, I suggest keeping metaclass subclasses as simple as possible, so that string is nice and small.


Get rid of all your uses of iteritems(), iterkeys(), itervalues(), and xrange().  You probably don't need the optimization these provide, and they do not exist in Python 3.  You can conditionalize around them, but I think in most cases it's not worth it.  If you really need the optimization, then you'll have to figure out a way around the missing names in Python 3.  But note that Python 3 is already more efficient for the first three, since you get back dictview objects instead of concrete lists.


PyArg_Parse() and friends lack a 'y' code in Python 2.  In Python 3, these return bytes objects.  Where I absolutely needed bytes in Python 3 and strs in Python 2, I just #ifdef'd around the PyArg_Parse() calls.  In Python 3, there's no equivalent of 'z' for bytes objects (which accept Nones and set the output variable to NULL in that case).  If this is important to you, you might need to write an O& converter.


Watch out for next() vs. __next__() when writing iterators.  Python 2 uses the former while Python 3 uses the latter.  Best to define the method once, and then support compatibility via `next = __next__` in your class definition.


operator.isSequenceType() is gone in Python 3.  Here's the code I use for compatibility:


def is_sequence(obj):
    try:
        from collections import Sequence
    except ImportError:
        from operator import isSequenceType
        return operator.isSequenceType(obj)
    else:
        return isinstance(obj, Sequence)


If you by chance use PyCObjects in your extension module, you'll have to switch these to PyCapsules for Python 3.  If you're lucky enough to be able to drop Python 2.6, you can use PyCapsules everywhere, since they are available in Python 2.7.


Let me close by saying that you shouldn't be frightened off by the prospect either of porting your code to Python 3, or supporting both Python 2 and Python 3 in a single code base.  It's definitely doable, and we in the Python community are gaining more experience at it every day.  I strongly feel that we are well on the track of Guido's original goal of mainstream Python 3 acceptance within 5 years of Python 3's release.  I think we're soon going to see a critical mass of Python 3 ports, after which time, you'll just seem old and creaky if you don't port to Python 3.


There are some other excellent references for helping you port out there on the 'net, and for the most part, I've tried not to duplicate their information.  Here are some useful places to start:
Enjoy!

Footnotes:
[1] It is not impossible to support both Python 3 and versions of Python 2 earlier than 2.6, just more difficult.  Michael Foord has had success doing this for libraries of his such as mock.  I just think it's more trouble than it's worth in most cases.


[2] Here's the clever hack, but first a set-up.  The reprs of many of the dbus-python objects are conditional on whether the variant_level is zero or not.  The variant_level is only included in the repr when it is greater than zero (with zero being the typical value).  This just means there are usually two calls to PyUnicode_FromFormat() in each C repr implementation, and #ifdef'ing them to use PyString_FromFormat() in Python 2 would just double the pain.  In addition, the reprs all include the repr of their parent objects, i.e. their base class repr.  The problem is that these base-class reprs will be PyBytes in Python 2 and PyUnicodes in Python 3, and there's nothing we can do about that.  As it turns out, Python 2.6 and Python 3.2 have a %V format with some very interesting semantics.  %V consumes two arguments, a PyObject* and a char*, but it only uses one of them.  When the first argument is not NULL, it uses that and ignores the second argument.  But when the first argument is NULL, it will use the second argument.


How can this help produce portable code?  I define the following macro and use this everywhere the %V format code is given:
#define REPRV(obj) \
    (PyUnicode_Check(obj) ? (obj) : NULL), \
    (PyUnicode_Check(obj) ? NULL : PyBytes_AS_STRING(obj))
which would be used at a call site something like this:

return PyUnicode_FromFormat("...%V...", REPRV(parent_repr));

In Python 2, where parent_repr is a PyBytes, REPRV() will return NULL as the first argument, and via PyBytes_AS_STRING(), a char* in the second argument. In Python 3, where parent_repr is a PyUnicode, the first argument will just be the object and the second argument will be NULL (but it is ignored by Python).  As long as parent_repr is either a PyUnicode or a PyBytes (a.k.a. PyString), this works perfectly, and keeps the call sites simple and sane.  Beware though because if parent_repr can be any other type, this will crash your program.  Fortunately, Python doesn't allow for arbitrary repr types - they must be bytes or unicodes, so in practice this is pretty safe.


[3] A recent thread in python-dev points out that this recommendation may not be practical if you're building PEP 3333-compliant WSGI applications.  My take on it is that PEP 3333's definition of "native strings" is a mistake, but sadly one that we have to live with for now.


[4] Here's what my Python-level flag looks like:
import sys
is_py3 = getattr(sys.version_info, 'major', sys.version_info[0]) == 3
Now I can use this in other code to switch behavior between Python 2 and Python 3.  For example, in dbus-python to import the UTF8String type in Python 2 only:
from dbus import is_py3
if is_py3:
   from _dbus_bindings import UTF8String
This is much easier and less error prone then doing the sys.version_info test everywhere.  The other problem is that sys.version_info is a namedtuple only in Python 2.7, so in Python 2.6, it has no attribute called 'major'.


The C-level macro looks like this:
#if PY_MAJOR_VERSION >= 3
#define PY3K
#endif
 So now C code only needs to do:

#ifdef PY3K
/* Do something Python 3-ish */
#else
/* Do something Python 2-ish */
#endif

You might also find the six package to be useful here, at least for writing portable Python code.

Wednesday, November 23, 2011

An update on Ubuntu's Python plans

Earlier this month, I attended UDS-P (the Ubuntu Developers Summit for 12.04 Precise Pangolin).  12.04 is an LTS release, or Long Term Support, meaning it will be officially supported on both the desktop and server for five years.

During the summit we reiterated our plans for Python on Ubuntu, both for 12.04 LTS, and our vision of Python two years from now for the next LTS, 14.04.  I'm here to provide an update from my last report, which was written after UDS-O for 11.10.

As per those previous plans, we've removed Python 2.6 from Ubuntu 12.04.  Now you have only Python 2.7 and 3.2.  Dropping Python 2.6 may cause some inconvenience for data centers which like to upgrade only between LTS's but don't want to have to upgrade both their operation system and their Python version.  Canonical services such as Launchpad and Landscape fall into this camp.  The biggest problem is that Python 2.7 is not available for the last LTS, i.e. 10.04.  To mitigate this, we decided to create a PPA containing Python 2.7 and a bunch of packages that services such as Launchpad will need to do their porting.  This now exists, although it hasn't yet been tested with any development branches of Launchpad or Landscape as far as I'm aware.  If you have your own data center porting task ahead of you, you can also use this PPA, and if there are additional packages you need for 10.04, you can create your own PPA which depends on ours, and build those extra packages there.

But that's all boring stuff.  Let's have some fun!

The other decision we made concerns Python 3.  I strongly feel that the Python community is very close to the tipping point in Python 3 adoption.  Yes, there's still tons of code that needs porting, but we're seeing more and more activity every day.  Even large frameworks such as Twisted are working on Python 3 branches.  I wrote PEP 404 that (somewhat tongue in check) expresses official Python pronouncement on the migration path from Python 2.7 to Python 3.  There will not be a Python 2.8, so the time to begin porting your code is now.

And it's really not that difficult.  There are several great guides available to help you with porting your pure Python packages to Python 3, available on both python.org and python3porting.com.  There are packages such as six available to ease the trickier bits of porting, and I have another blog post coming that describes my experiences in porting dbus-python. This latter task is made more complex by being a fairly involved C extension module.  Again, both python.org and python3porting.com have useful guides for porting extension modules, although they have missed a few things.  I wrote a mailing list article on some of the things that I had to do, but the blog post will include more detail.

Other distros such as Fedora are also starting to work on porting essential packages to Python 3, and here's where free software and open source rules.  Our distros can work together to help nudge the Python community over that tipping point.  Fedora is working from a bottom-up approach, where they are porting packages as needed by request, and we in Ubuntu are working top-down, by identifying several desktop applications that we'll port, along with their dependency stack.  We'll both be pushing our changes to the upstream projects so every Python developer can benefit, no matter what platform they are on.

My long term vision is that Ubuntu 14.04 won't even have Python 2 on the CD images (if we even have CD images in two years), and we'll also move Python 2.7 off to universe.  It will never completely go away, but the operating system and all the applications you get in a default install will run on Python 3.  I'm much more sanguine about this goal now that I was even six months ago, but that's not to say it still won't be a lot of work.

The time is now to move to Python 3, if for no other reason than to fix all those pesky UnicodeErrors.  I think there's a lot more of value in Python 3 too of course, and Python 3.3 will be even more awesome.  I expect Python 3.3 or 3.4 to be in Ubuntu 14.04.

The place to be for all Python 3 porting efforts is the python-porting mailing list.

UPDATE: Fixed Python 3.3/3.4 plans to correctly read for Ubuntu 14.04 (thanks Nick!)

Wednesday, October 12, 2011

Python internationalization without exploding your brain

Today I helped a colleague out with a pesky internationalization (i18n) problem in a Python 2 application.  The problem was subtle enough that I thought I'd write this article to save others from the same brain exploding pain.

First, your Python application is multi-language friendly, right?  I mean, I'm as functionally monolinguistic as most Americans, but I love the diversity of languages we have in the world, and appreciate that people really want to use their desktop and applications in their native language.  Fortunately, it's not that hard to write good i18n'd Python code, and there are many tools available for helping volunteers translate your application, such as Pootle and Launchpad translations.

So there really is no excuse not to i18n your Python application.  In fact, GNU Mailman has been i18n'd for many years, and pioneered the supporting code in Python's standard library, namely the gettext module.  As part of the Mailman 3 effort, I've also written a higher level library called flufl.i18n which makes it even easier to i18n your application, even in tricky multi-language contexts such as server programs, where you might need to get a German translation and a French translation in one operation, then turn around and get Japanese, Italian,and English for the next operation.

In any case, my colleague was having a problem in a typically simple command line program.  What's common about these types of applications is that you fire them up once, they run then exit, and they only have to deal with one language during the entire execution of the program, specifically the language defined in the user's locale.  If you read the gettext module's documentation, you'd be inclined to do this at the very start of your application:

from gettext import gettext as _
gettext.textdomain(my_program_name)

then, you'd wrap translatable strings in code like this:

print _('Here is something I want to tell you')

What gettext does is look up the source string in a translation catalog, returning the text in the appropriate language, which will then be printed.  There are some additional details regarding i18n that I won't go into here.  If you're curious, ask in the comments, and I'll try to fill things in.

Anyway, if you do write the above code, you'll be in for a heap of trouble, as my colleague soon found out.  Just running his program with --help in a French locale, he was getting the dreaded UnicodeEncodeError:

"UnicodeEncodeError: 'ascii' codec can't encode character"

I've also seen reports of such errors when trying to send translated strings to a log file (a practice which I generally discourage, since I think log messages usually shouldn't be translated).  In any case, I'm here to tell you why the above "obvious" code is wrong, and what you should do instead.

First, why is that code wrong, and why does it lead to the UnicodeEncodeErrors?  What might not be obvious from the Python 2 gettext documentation is that gettext.gettext() always returns 8-bit strings (a.k.a. byte strings in Python 3 terminology), and these 8-bit strings are encoded with the charset defined in the language's catalog file.  Now, it's generally best practice in Python to always deal with human readable text using unicodes, converting any bytes to unicode as early as possible.  This is traditionally more problematic in Python 2, where English programs can cheat and use 8-bit strings and usually not crash, since their character range overlaps with ASCII and you only ever print to English locales, which are compatible with ASCII.  As soon as you use text that's not ASCII though, you're probably going to run into trouble.  By using unicodes everywhere, you can generally avoid such problems, and in fact it will make your life much easier when you eventually switch to Python 3.

So the 8-bit strings that gettext.gettext() hands you will really just hurt you, and to avoid the pain, you'd want to convert them back to unicodes before you print them to stdout or a log file.  However, converting to unicodes makes the i18n APIs much less convenient, so no one does it until there's way too much code to fix.

What you really want in Python 2 is something like this:

from gettext import ugettext as _

which you'd think you should be able to do, the "u" prefix meaning "give me unicode".  But for reasons I can only describe as based on our misunderstandings of unicode and i18n back in the days this module was originally written, you can't actually do that, because ugettext() is not exposed as a module-level function.  It is available in the class-based API, but that's a more advanced API that again almost no one uses.  Sadly, it's too late to fix this in Python 2.  The good news is that in Python 3 it is fixed, not by exposing ugettext(), but by changing the most commonly used gettext module APIs to return unicode strings directly, as it always should have done.  In Python 3, the obvious code just works:

from gettext import gettext as _

What can you do in Python 2 then?  Here's what you should use instead of the two lines of code at the beginning of this article:

_ = gettext.translation(my_program_name).ugettext

and now you can wrap all your translatable strings in _('Foo') and it should Just Work.  Or you can use the flufl.i18n API, which always uses ugettext and always returns unicode strings.

The one-liner above was in fact the solution my colleague implemented to fix his bug, so I didn't really help him much.  I just explained why his fix was the correct one and the original code was buggy.  At least now you know too!

(The fact that this works correctly in Python 3 is yet another reason to switch to Python 3!)

Aside: It's really fun to install a desktop in a language you cannot read.  Fortunately, French, like other Latin-drived languages, has enough familiarity that I can mostly get by with the standard Ubuntu installer.  It also doesn't hurt that I've done about a bazillion installs of Ubuntu Oneiric (due out tomorrow!) so I pretty much know what each screen and prompt means even without going to Google Translate.

Also interesting was that I could never reproduce the crash when ssh'd into the French locale VM. It would only crash for me when I was logged into a terminal on the VM's graphical desktop.  The only difference between the two that I could tell was that in the desktop's terminal, locale(8) returned French values (e.g. fr_FR.UTF-8) for everything, but in the ssh console, it returned the French values for everything except the LC_CTYPE environment variable.  For the life of me, I could not get LC_CTYPE set to anything other than en_US.UTF-8 in the ssh context, so the reproducible test case would just return the English text, and not crash.  This happened even if I explicitly set that environment variable either as a separate export command in the shell, or as a prefix to the normally crashing command.  Maybe there's something in ssh that causes this, but I couldn't find it.

Thursday, September 1, 2011

sbuild with local, newer, dependencies

sbuild is an excellent tool for locally building Ubuntu and Debian packages.  It fits into roughly the same problem space as the more popular pbuilder, but for many reasons, I prefer sbuild.  It's based on schroot to create chroot environments for any distribution and version you might want.  For example, I have chroots for Ubuntu Oneiric, Natty, Maverick, and Lucid, Debian Sid, Wheezy, and Squeeze, for both i386 and amd64.  It uses an overlay filesystem so you can easily set up the primary snapshot with whatever packages or prerequisites you want, and the individual builds will create a new session with an overlaid temporary filesystem on top of that, so the build results will not affect your primary snapshot.  sbuild can also be configured to save the session depending on the success or failure of your build, which is fantastic for debugging build failures.  I've been told that Launchpad's build farm uses a customized version of sbuild, and in my experience, if you can get a package to build locally with sbuild, it will build fine in the main archive or a PPA.

Right out of the box, sbuild will work great for individual package builds, with very little configuration or setup.  The Ubuntu Security Team's wiki page has some excellent instructions for getting started (you can stop reading when you get to UMT :).

One thing that sbuild doesn't do very well though, is help you build a stack of packages.  By that I mean, when you have a new package that itself has new dependencies, you need to build those dependencies first, and then build your new package based on those dependencies.  Here's an example.

I'm working on bug 832864 and I wanted to see if I could build the newer Debian Sid version of the PySide package.  However, this requires newer apiextractor, generatorrunner, and shiboken packages (and technically speaking, debhelper too, but I'm working around that), so you have to arrange for the chroot to have those newer packages when it builds PySide, rather than the ones in the Oneiric archive.  This is something that PPAs do very nicely, because when you build a package in your PPA, it will use the other packages in that PPA as dependencies before it uses the standard archive.  The problem with PPAs though is that when the Launchpad build farm is overloaded, you might have to wait several hours for your build.  Those long turnarounds don't help productivity much. ;)

What I wanted was something like the PPA dependencies, but with the speed and responsiveness of a local build.  After reading the sbuild manpage, and "suffering" through a scan of its source code (sbuild is written in Perl :), I found that this wasn't really supported by sbuild.  However, sbuild does have hooks that can run at various times during the build, which seemed promising.  My colleague Kees Cook was a contributor to sbuild, so a quick IRC chat indicated that most people create a local repository, populating it with the dependencies as you build them.  Of course, I want to automate that as much as possible.  The requisite googling found a few hints here and there, but nothing to pull it all together.  With some willful hackery, I managed to get it working.

Rather than post some code that will almost immediately go out of date, let me point you to the bzr repository where you can find the code.  There are two scripts: prep.sh and scan.sh, along with a snippet for your ~/.sbuildrc file to make it even easier.  sbuild will call scan.sh first, but here's the important part: it calls that outside the chroot, as you (not root). You'll probably want to change $where though; this is where you drop the .deb and .dsc files for the dependencies.  Note too, that you'll need to add an entry to your /etc/schroot/default/fstab file so that your outside-the-chroot repo directory gets mapped to /repo inside the chroot.  For example:
# Expose local apt repository to the chroot
/home/barry/ubuntu/repo    /repo    none   rw,bind  0 0
An apt repository needs a Packages and Packages.gz file for binary packages, and a Sources and Sources.gz file for the source packages.  Secure APT also requires a Release and Release.gpg file signed with a known key.  The scan.sh file sets all this up, using the apt-ftparchive command.  The first apt-ftparchive call creates the Sources and Sources.gz file.  It scans all your .dsc files and generates the proper entries, then creates a compressed copy, which is what apt actually "downloads".  The tricky thing here is that without changing directories before calling apt-ftparchive, your outside-the-chroot paths will leak into this file, in the form of Directory: headers in Sources.gz.  Because that path won't generally be available inside the chroot, we have to get rid of those headers.  I'm sure there's an apt-ftparchive option to do this, but I couldn't find it.  I accidentally discovered that cd'ing to the directory with the .dsc files was enough to trick the command into omitting the Directory: headers.

The second call to apt-ftparchive creates the Packages and Packages.gz files.  As with the source files, we get some outside-the-chroot paths leaking in, this time as path prefixes to the Filename: header value.  Again, we have to get rid of these prefixes, but cd'ing to the directory with the .deb files doesn't do the trick.  No doubt there's some apt-ftparchive magical option for this too, but sed'ing out the paths works well enough.

The third apt-ftparchive file creates the Release file.  I shameless stole this from the security team's update_repo script.  The tricky part here is getting Release signed with a gpg key that will be available to apt inside the chroot.  sbuild comes with its own signing key, so all you have to do is specify its public and private keys when signing the file.  However, because the public file from
/var/lib/sbuild/apt-keys/sbuild-key.pub
won't be available inside the chroot, the script copies it to what will be /repo inside the chroot.  You'll see later how this comes into play.

Okay, so now we have the repository set up well enough for sbuild to carry on.  Later, before the build commences, sbuild will call prep.sh, but this script gets called inside the chroot, as the root user.  Of course, at this point /repo is mounted in the chroot too.  All prep.sh needs to do is add a sources.list.d entry so apt can find your local repository, and it needs to add the public key of the sbuild signing key pair to apt's keyring.  After it does this, it needs to do one more apt-get update.  It's useful to know that at the point when sbuild calls prep.sh, it's already done one apt-get update, so this does add a duplicate step, but at least we're fortunate enough that prep.sh gets called before sbuild installs all the build dependencies.  Once prep.sh is run, the chroot will have your overriding dependent packages, and will proceed with a normal build.

Simple, huh?

Besides getting rid of the hackery mentioned above, there are a few things that could be done better:
  • Different /repo mounts for each different chroot
  • A command line switch to disable the /repo
  • Automatically placing .debs into the outside-the-chroot repo directory

Anyway, it all seems to hang together.  Please let me know what you think, and if you find better workarounds for the icky hacks.