2014/03/15

Faster python deployments with wheels

One of the most annoying issues I have with python packaging system is time it takes to deploy any non-trivial app. Recent projects I was working on have large list of several packages they depend on,
which again have their own dependencies. This is typically specified as requirement.txt file that can be processed by pip (pip install -r requirements.txt), which may look like this:

django==1.6
djangorestframework==2.3.10
psycopg2==2.5.2
south==0.8.4
...

(small tip: if you want quickly discover latest version of package, use yolk).

Such list tends to grow with your project, and its hard to ever remove anything from it.
The main problem with the way pip handles it is that:

  1. Pip processes it sequentially, so your 16 cores and your network pipe are underutilised, and all download times just add up.
  2. Compilation of  complex extensions take forever. (and amazon micro instances requires setting up swap to even be able to do that).
There are few ways to deal with this problem. You can start with setting up download cache for pip,
which obviously will help with download times. You can create and reuse single environment, which will store all packages and pip will only install or update packages that were changed or added. This approach generally works, but once in a while update goes wrong and you may spend long time trying to figure how to fix it, so I prefer to build fresh environment every time. Or you can invent your own way of doing it. Either way until very recently setting up deployment properly required certain amount of tinkering with the way python packages are build and deployed. (Well, it still does, but the amount has been greatly reduced.)

So if you hate wasted time and complexity introduced by compiling c extensions on every host, you will (almost) love wheel. Wheel is new format for storing and deploying python packages, and main advantage is that it allows to include compiled code in it. So finally its possible to compile all packages on build machine,
and deploy binary form to all target hosts easily. This is still a bit of a bleeding edge, as only recently released pip 1.5.4 fixed a bug related to downloading dependencies that was making wheels practically useless.

It is however working properly now, so lets enter brave new world:

mkdir test && cd test
virtualenv .
. ./bin/activate
pip install wheel
pip install --upgrade pip>=1.5.4
mkdir wheels

and finally the most important bit:

pip wheel --wheel-dir wheels -r requirements.txt

(one more tip: with recent version of pip and certain packages you run into problems with pip not willing to download externally hosted files. In that case you may want to add them as exceptions with --allow-external and --allow-unverified flags).

This will create wheels containing all required packages (and their dependencies) which can be distributed with your app (at least to machines with the same architecture/os/lib versions, which is all I care for).
The only issue I have is that for reasons I completely don't understand, pip wheel command
does not use wheel directory as a cache, building everything from scratch every time. Sequentially of course.
So just putting it into deployment script still will result in great amount of wasted time.
Luckily this simple script will solve the problem:

$ cat build_new_wheels.py

#!/usr/bin/env python
"""
Obtain packages listed in requirement file
and download/build wheels for them as needed

USAGE:

wheels.py WHEEL_DIR REQUIREMENTS_FILE

"""
import os
import sys
import subprocess

def check_wheel(pkg, ver, wheels):
"""
Check if there is wheel for given pkg/version. Note that python version and arch is ignored here, so it will break if you mix them.
"""
_pkg = pkg.lower().replace('-', '_')
s = _pkg
if ver:
s = '{0}-{1}-'.format(_pkg, ver)
for wheel in wheels:
if wheel.lower().startswith(s):
return True
return False


WHEEL_DIR = sys.argv[-2]
WHEELS = os.listdir(WHEEL_DIR)
REQ_FILE = sys.argv[-1]
PACKAGES = []
lines = []
with open(REQ_FILE) as f:
lines = f.readlines()
for line in lines:
line = line.strip()
if line and not line.startswith('#'):
if '==' in line:
pkg, ver = line.split('==')
else:
pkg, ver = line, None
PACKAGES.append((pkg,ver))

for pkg, ver in PACKAGES:
build = True
if not check_wheel(pkg, ver, WHEELS):
print 'building', pkg, ver
pkg_spec = pkg
if ver:
pkg_spec = '{0}=={1}'.format(pkg, ver)
exit_code = subprocess.call(['pip', 'wheel', '--wheel-dir', WHEEL_DIR, pkg_spec])
if exit_code != 0:
sys.stderr.write('Error building wheel for {0}\n'.format(pkg_spec))
os.exit(1)

exit_code = subprocess.call(['pip', 'install', '--no-index', '--find-links', WHEEL_DIR, '-r', REQ_FILE])
if exit_code != 0:
print 'pip exited with non-zero exit code'
os.exit(1)

You can use it simply by specifying wheel directory and requirement file:

$ ./build_new_wheels.py wheels requirements.txt

and it will only build wheels that don't exists in wheels directory. Note that while this script is rather proof of concept and does not support all features that can be used in requirements file or all wheel options (>= operator, separation of various python versions, git or file repositories, mixing python versions),
it allows me to only build wheels for packages that were introduced or changed since last build.
Parallel processing could also be easily added here  thanks to multiprocessing module.
I really would like to see this (or similar) behaviour added to pip, as that would finally make it fully usable without custom work.

If you want to know more about wheel format, go rigtht there: http://wheel.readthedocs.org/en/latest/

UPDATE: Work on caching wheels is happening here: https://github.com/pypa/pip/pull/1572

No comments:

Post a Comment