Wednesday, January 18, 2012

Debian packaging for Python 2 and 3

Time for another installment of my ongoing mission to convert the world to Python 3!  This time, a little Debian packaging-fu for modifying an existing Python 2 package to include support for Python 3 from the same source package.

Today, I added a python3-feedparser package to Ubuntu Precise.  What's interesting about this is that, despite various reported problems, upstream feedparser 5.1 claims to support Python 3, via 2to3 conversion.  And indeed it does (although the test suite does not).

Before today, Ubuntu had feedparser 5.0.1 in its archive, and while some work has been done to update the Debian package to 5.1, this has not been released.  The uninteresting precursor to Python 3 packaging was to upgrade the Ubuntu version of the python-feedparser source package to 5.1.  I'll spare you the boring details about missing data files in the upstream tarball, and other problems, since they don't really relate to the Python 3 effort.

The first step was to verify that feedparser 5.1 works with Python 3.2 in a virtualenv, and indeed it does.  This is good news because it means that the setup.py does the right thing, which is always the best way to start supporting Python 3.  I've found that it's much easier to build a solid Debian package if you have a solid setup.py in upstream to begin with.

Now, what I'd like to do is to give you a recipe for modifying your existing debian/ directory files to add Python 3 support to a package that already exists for Python 2.  This is a little trickier for feedparser because it used an older debhelper standard, and carried some crufty old stuff in its rules file.  My first step was to update this to debhelper compatibility level 8 and greatly simplify the debian/rules file.  Here's what it might have looked like with just Python 2 support, so let's start there.


#!/usr/bin/make -f
export DH_VERBOSE=1

%:
    dh $@ --with python2

override_dh_auto_clean:
    dh_auto_clean
    rm -rf build .*egg-info

override_dh_auto_test:
ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))
    cd feedparser && python ./feedparsertest.py
else
    @echo "nocheck set, not running tests"
endif

override_dh_installdocs:
    dh_installdocs -Xtests

This is all pretty standard stuff.  dh_python2 is used (the --with python2 option to dh), and we just provide a couple of overrides for idiosyncrasies in the feedparser package.  We clean a couple of extra things that aren't cleaned automatically, and we run the test suite in the slightly non-standard way that upstream requires.  Also, we override the installation of a huge amount of test files that would otherwise get installed as documentation (they aren't docs).

So far so good.  What do we have to do to add support for Python 3?

First, we need to make a few modifications to the debian/control file.  The current convention with dh_python2 is to use an X-Python-Version header in the source package stanza, so we just need to add this header to the same stanza for Python 3:

X-Python3-Version: >= 3.2

This just says we support any Python 3 version from 3.2 onwards.  You also need to add a few additional packages to the Build-Depends.  In the feedparser case, I added the following build dependencies: python3, python3-chardet, python3-setuptools.  Even though for Python 2 there are a couple of other build dependencies (e.g. python-libxml2 and python-utidylib) these aren't available for Python 3, but lucky for us, they are optional anyway.

Next, you need to add a new binary package stanza.  There was already a python-feedparser binary package stanza for Python 2 support.  In Debian, Python 3 is provided as a separate stack, meaning packages for Python 3 will always start with the python3- prefix.  Thus, it is pretty easy to just copy the python-feedparser stanza and paste it to the bottom of debian/rules, changing the package name to python3-feedparser.  You have to update the Depends line to use ${python3:Depends} and I updated the Recommends line to name python3-chardet, and that was about it.  Here's what the new stanza looks like:


Package: python3-feedparser
Architecture: all
Depends: ${misc:Depends}, ${python3:Depends}
Recommends: python3-chardet
Description: Universal Feed Parser for Python
 Python module for downloading and parsing syndicated feeds. It can
 handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS
 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom, and CDF feeds.
 .
 It provides the same API to all formats, and sanitizes URIs and HTML.
 .
 This is the Python 3 version of the package.
Again, so far so good.  Now let's look at the debian/rules file.

The first thing to do is to add support for dh_python3, which is analogous to dh_python2, and is the only accepted helper for Python 3.  The rules line then becomes:


%:
    dh $@ --with python2,python3
Now, one problem with debhelper is that it doesn't have any built-in support for Python 3 like it does for Python 2.  This means dh will not automatically build or install any Python 3 packages, so you have to do this manually.  Eventually, this will be fixed, and fortunately with a solid setup.py file, you don't have to do to much, but it's something to be aware of.  In the feedparser case, we need to add overrides for dh_auto_build and dh_auto_install.  Here's what these rules look like:


override_dh_auto_build:
    dh_auto_build
    set -ex; for python in $(shell py3versions -r); do \
        $$python setup.py build; \
    done;

override_dh_auto_install:
    dh_auto_install
    set -ex; for python in $(shell py3versions -r); do \
        $$python setup.py install --root=$(CURDIR)/debian/tmp --install-layout=deb; \
    done;
    cp feedparser/sgmllib3.py $(CURDIR)/debian/tmp/usr/lib/python3/dist-packages/feedparser_sgmllib3.py
Not too bad, eh?  You'll notice that the first thing these rules do is call the standard dh_auto_build and dh_auto_install respectively.  This preserves the Python 2 support.  Then we just loop over all the available Python 3 versions, doing a fairly normal equivalent of setup.py install (split into a build step and an install step).  The install rule looks a little odd, but should be familiar to Debian Python hackers.  It just installs the package into the proper Debian locations, and will pretty much be the same for any Python 3 package you build.

The one odd bit is the last line in the override_dh_auto_install rule.  This is there just to work around an peculiarity in the feedparser 5.1 upstream package, where it depends on sgmllib.py, but that is no longer in the Python standard library in Python 3.  Upstream provides an already 2to3 converted version of it, and recommends you install the module as sgmllib.py somewhere on your Python 3 sys.path.  Well, I don't like the namespace pollution that would cause, so I install the file as feedparser_sgmllib3.py and add a quilt patch to the package to try an import of that module if importing sgmllib fails (as it will on Python 3).

An aside: If you look in the debian/rules file for what I actually uploaded, you'll see some additional modifications to override_dh_auto_test.  This just works around the upstream bug where some test suite data files were accidentally omitted from the release tarball.  You can pretty much ignore those lines for the purposes of this article.

We're almost done.  The last thing we need to do is make sure that debhelper installs the right files into the right binary packages.  We want the python-feedparser binary package to include only the Python 2 files, and the python3-feedparser binary package to only include the Python 3 files.  Keep in mind that when a source package builds only a single binary package (as was the case before I added Python 3 support), debhelper will include everything under the build directory's debian/tmp subdirectory in the single binary package.  That's why you see things get installed into $(CURDIR)/debian/tmp.  But when a source package builds multiple binary packages, as is now the case here, we have to tell debhelper which files go into which binary packages.  We do this by adding two new files to the debian directory: python-feedparser.install and python3-feedparser.install

Reading the manpage for dh_install will explain the reasons for this, and describe the format of the file contents.  In our case, we're really lucky, because for Python 2, everything gets installed under usr/lib/python2.* and in Python 3, everything gets installed under usr/lib/python3 (relative to $(CURDIR)/debian/tmp).  You'll notice a few things here.  Because we could be building for multiple versions of Python 2, we have to wildcard the actual directory under usr/lib, e.g. it might be python2.6 or python2.7.  But because we have PEP 3147 and PEP 3149 in Python 3.2, there's only one directory for all supported versions of Python 3, so we don't need to wildcard the subdirectory.  Also, if you look at the actual .install files in the package, you'll see a few other trailing path components, so the actual contents of the files are:


usr/lib/python2.*/*-packages/*
and


usr/lib/python3/*-packages/*
for the python-feedparser.install and python3-feedparser.install files respectively.  The trailing bits just wildcard what on a Debian system will always be dist-packages, just for safety (cargo culting FTW!).

And that really is it!  Of course, things could be a little more complicated if you have extension modules, but maybe not that much more so, and if the package you're adding Python 3 support to isn't setuptools-based, you may have more work to do even still.  The feedparser package has a few other oddities that are really unrelated to adding Python 3 support, so I'm ignoring them here, but feel free to ask for additional details in the comments, in IRC, or in email.

Hopefully this gives you some insight into how to extend an existing Python 2 Debian package into including Python 3 support, given that your upstream already supports Python 3.  Now, go forth and hack!

Addendum: my colleague Colin Watson just today packaged up Benjamin Peterson's very fine Python package called six.  This is a nice package that provides some excellent Python 2 and 3 compatibility utilities.  You may find this helpful if you're trying to support both Python 2 and Python 3 in a single code base, especially if you have to support back to Python 2.4 (poor you :).  This will be available in Ubuntu Precise, although if you're submitting patches back upstream, you may have to convince the upstream author to accept the additional dependency.  It's worth it to add a little more Python 3 love to the world.

Friday, January 6, 2012

Python 3 Porting Fun Redux

My last post on Python 3 porting got some really great responses, and I've learned a lot from the feedback I've seen.  I'm here to rather briefly outline a few additional tips and tricks that folks have sent me and that I've learned by doing other ports since then.  Please keep them coming, either in the blog comments or to me via email.  Or better yet, blog about your experiences yourself and I'll link to them from here.

One of the big lessons I'm trying to adopt is to support Python 3 in pure-Python code with a single code base.  Specifically, I'm trying to avoid using 2to3 as much as possible.  While I think 2to3 is an excellent tool that can make it easier to get started supporting both Python 2 and Python 3 from a single branch of code, it does have some disadvantages.  The biggest problem with 2to3 is that it's slow; it can take a long time to slog through your Python code, which can be a significant impediment to your development velocity.  Another 2to3 problem is that it doesn't always play nicely with other development tools, such as python setup.py test and virtualenv, and you occasionally have to write additional custom fixers for conversion that 2to3 doesn't handle.

Given that almost all the code I'm writing these days targets Python 2.6 as the minimal supported Python 2 version, 2to3 may just be unnecessary.  With my dbus-python port to Python 3, and with my own flufl packages, I'm experimenting with ignoring 2to3 and trying to write one code base for all of Python 2.6, 2.7, and 3.2.  My colleague Michael Foord has been pretty successful with this approach going back all the way to Python 2.4, so 2.6 as a minimum should be no problem!  C extensions are pretty easy because you have the C preprocessor to help you.  But it turns out that it's usually not too difficult in pure-Python either.  I've done this in my latest release of the flufl.bounce package, and intend to eliminate 2to3 in my other flufl packages soon too.

The first thing I've done is add print_function to the __future__ import in all my modules.  Previously, I was only importing unicode_literals and absolute_import.  But doctests tend to use a lot of print statements, so switching to the print() function explicitly removes one big 2to3 conversion.  Aside from having to unlearn decades of print statement muscle memory, the print() function is actually rather nice.  So my module template now looks like this (with the copyright comment block omitted):

from __future__ import absolute_import, print_function, unicode_literals

__metaclass__ = type
__all__ = [
    ]

Speaking of doctests, you really want them to have the same set of future imports as all your other code.  I'll talk more about how my own packages set up doctests later, but for now, it's useful to know that I create a doctest.DocFileSuite for every doctest in my package.  These suites all have a setup() function and Python's testing framework will call these at the appropriate time, passing in a testobj parameter.  This argument has a globs attribute which serves as the module globals for the doctest.  All you need to do to enable the future imports in your doctests is to do something like this:

def setup(testobj):
    try:
        testobj.globs['absolute_import'] = absolute_import
        testobj.globs['print_function'] = print_function
        testobj.globs['unicode_literals'] = unicode_literals
    except NameError:
        pass

The try-except really is only necessary if you keep using 2to3, since that tool will remove the future imports from all the modules it processes.  The future imports still exist in Python 3 of course, since future imports are never, ever removed.  So if you ditch 2to3, you can get rid of the try-except too.

In the latest release of flufl.bounce, I changed the API so that the detected email addresses are all explicitly bytes objects in Python 3 (and 8-bit strings in Python 2).  This caused some problems with my doctests because the repr of Python 3 bytes objects is different than the repr of 8-bit strings in Python 2.  When you print the object in Python 2, you get just the contents of the string, but when you print them in Python 3, you get the b''-prefix.

% python
Python 2.7.2+ (default, Dec 18 2011, 17:30:39)
[GCC 4.6.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print b'foo'
foo
>>>
% python3
Python 3.2.2+ (default, Dec 19 2011, 12:03:32)
[GCC 4.6.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(b'foo')
b'foo'


This means your doctest cannot be written to easily support both versions of Python when bytes/8-bit strings are used.  I use the following helper to get around this:


def print_bytes(obj):
    if bytes is not str:
        obj = repr(obj)[2:-1]
    print(obj)


Remember that in Python 2, bytes is just an alias for str so this code only gets invoked in Python 3.

Another fun bytes/8-bit-string issue is that in Python 3, bytes objects have no .format() method.  So if you're doing something like b'foo {0}'.format(obj) this will work in Python 2, but fail in Python 3. The best I've come up with for this is to use concatenation instead, or do the format using unicodes and then encode them to their bytes object (but then you have the additional fun of choosing an appropriate encoding!).

Did you know that the re module can scan either unicodes or bytes in Python 3?  The switch is made by passing in either a bytes pattern or a str pattern, and then passing in the appropriate type of object to parse.  But, if you use the r''-prefix (i.e. raw strings) for saner handling of backslashes, you've got another problem when you want to parse bytes.  Python does not support rb''-prefixes, meaning you can have either raw string literals or bytes string literals but not both.  You have to forgo one or the other, and I usually come down on the side of ditching the raw strings and suffering the pain of backslash proliferation.

Some of the code I was porting was using itertools.izip_longest(), but this doesn't exist in Python 3.  Instead you have itertools.zip_longest().  You'll have to do a conditional import (i.e. try-except) around this to get the right version.

Do you use zope.interfaces?  You'll be interested to know that the syntax we've long been accustomed to for declaring that a class implements an interface does not work in Python 3.  For example:

from zope.interface import Interface, implements
class MyInterface(Interface):
    pass
class MyClass:
    implements(MyInterface)

This is because the stack hacking that implements() uses doesn't work in Python 3.  Fortunately, the latest version of zope.interface has a new class decorator that you can use instead.  This works in Python 2.6 and 2.7 too, so change your code to use this:

from zope.interface import Interface, implementer
class MyInterface(Interface):
    pass
@implementer(MyInterface)
class MyClass:
    pass

I kind of like the use of class decorators better anyway.

Here's a tricky one.  Did you know that Python 2 provides some codecs for doing interesting conversions such as Caeser rotation (i.e. rot13)?  Thus, you can do things like:

>>> 'foo'.encode('rot-13')
'sbb'

This doesn't work in Python 3 though, because even though certain str-to-str codecs like rot-13 still exist, the str.encode() interface requires that the codec return a bytes object. In order to use str-to-str codecs in both Python 2 and Python 3, you'll have to pop the hood and use a lower-level API, getting and calling the codec directly:

>>> from codecs import getencoder
>>> encoder = getencoder('rot-13')
>>> rot13string = encoder(mystring)[0]

You have to get the zeroth-element from the return value of the encoder because of the codecs API.  A bit ugly, but it works in both versions of Python.

That's all for now.  Happy porting!