me@machine: ~/glusterfs/ziptest$ dd if=/dev/zero count=1 of=test.dat 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.000357 seconds, 1.4 MB/s me@machine: ~/glusterfs/ziptest$ zip test.zip test.dat adding: test.dat (deflated 98%) me@machine: ~/glusterfs/ziptest$ ls -lh total 16K -rw-r--r-- 1 kml kml 512 Oct 6 13:47 test.dat -rw-r--r-- 1 kml kml 67 Oct 6 13:58 test.zip me@machine: ~/glusterfs/ziptest$ unzip test.zip Archive: test.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of test.zip or test.zip.zip, and cannot find test.zip.ZIP, period.

Although the error pops up while unzipping, the archive itself is corrupted. This can be seen by copying the archive to another, non-glusterfs partition, where the error still occurs. A file zipped on a different partition and copied to glusterfs, however, will unzip nicely.

I haven't studied the cause of this corruption, but I presume it is connected with the central directory file header. The glusterfs setup in this case uses a simple distributed configuration, so it is not an issue with striping, although I haven't looked into any other configuration options. My personal solution was to abandon the application that required zip, and to use tar with gzip or bzip2 instead.

]]>Strictly speaking I will retain the core features of a blog, namely reverse-chronological order and some form of commenting. The infrastructure, however, is based on Blogofile, which is a static website compiler written in Python, and the ubiquitous git revision control system. In fact, I've decided to version control the whole source code for the website and to publish it openly on github.

I will refer to the things I write here as *notes*, and not posts as in typical blogs, because most will be technical and short, and they will probably be published irregularly. Also, I might go back and update, change, correct and extend notes in the future. But since the source is versioned, it is also possible to go back in time if needed.

There are tools for speeding up Python with C, such as weave and Cython, or with Fortran (see f2py). But the whole point in quick and dirty solutions is to stay away from them, since writing pure Python is simpler. The first thing to do is to use NumPy effectively, and that is what this note is about.

A random vector, a one dimensional `numpy.ndarray`

object with three random elements:

>>> import numpy as np >>> v = np.random.random((3,)) >>> print v [ 0.21683143 0.47678871 0.48953654]

We will need to compare different ways of computing the norm. I will define lambda functions, which can be inspected and timed like this:

>>> import inspect >>> import timeit >>> def timenorm(norm, number): ... name = inspect.getsource(norm).split("=")[0].strip() ... code = inspect.getsource(norm).split(":")[1].strip() ... setup = "from __main__ import np,v" ... tim = timeit.timeit(code, setup, number=number) ... value = eval("%s(v)" %name) ... print "%s: %.6f time: %.3f code: %s" %(name, value, tim, code)

This timing function will print the value of the norm (a sanity check), the time it took for *number* of repetitions, and the code actually executed.

The first thing to try is the norm provided with numpy.linalg:

>>> mynorm1 = lambda v: np.linalg.norm(v) >>> timenorm(mynorm, 5*10**6) mynorm1: 1.144364 time: 51.915 code: np.linalg.norm(v)

But numpy.linalg does a number of things we don't need, so let's try a few other versions:

>>> mynorm2 = lambda v: np.sqrt(np.sum(np.dot(v,v))) >>> mynorm3 = lambda v: np.sqrt(np.sum(v*v)) >>> mynorm4 = lambda v: np.sqrt(v[0]*v[0] + v[1]*v[1] + v[2]*v[2]) >>> for mynorm in mynorm1, mynorm2, mynorm3, mynorm4: ... timenorm(mynorm, 5*10**6) mynorm1: 0.716931 time: 52.390 code: np.linalg.norm(v) mynorm2: 0.716931 time: 48.008 code: np.sqrt(np.sum(np.dot(v,v))) mynorm3: 0.716931 time: 46.823 code: np.sqrt(np.sum(v*v)) mynorm4: 0.716931 time: 21.482 code: np.sqrt(v[0]*v[0] + v[1]*v[1] + v[2]*v[2])

It seems that the explicit version, without the additional function calls, is somewhat quicker if you have a single vector, although it will not support vectors of arbitrary length. Reasonable, but meaningless since in all cases we are dealing with microseconds per function call.

The whole point of NumPy is to get rid of loops by vectorizing, and in practice one typically deals with large sets of different vectors. So it would make more sense to benchmark on an array of vectors:

>>> V = np.random.random((1000,3)) >>> print V [[ 0.73755195 0.78344111 0.02725284] [ 0.49455093 0.08837641 0.78106238] [ 0.97095203 0.64497806 0.53856876] ..., [ 0.67676871 0.41127143 0.89213647] [ 0.50376334 0.01370871 0.35758737] [ 0.05427026 0.42527007 0.88730196]]

Here are `mynorm1`

and `mynorm3`

translated into list comprehensions:

>>> mynorms1 = lambda V: [np.linalg.norm(v) for v in V] >>> mynorms2 = lambda V: [np.sqrt(sum(v*v)) for v in V]

In the second case it will be more efficient to take the square root for the whole resulting array at once. Here is that modification, and a similar one for `mynorm4`

:

>>> mynorms3 = lambda V: np.sqrt([sum(v*v) for v in V]) >>> mynorms4 = lambda V: np.sqrt([v[0]*v[0] + v[1]*v[1] + v[2]*v[2] for v in V])

And finally, a compact version that sticks with arrays all the way:

mynorms5 = lambda V: np.sqrt((V*V).sum(axis=1))

Now a comparison (`timenorms`

is available in the attached script):

>>> for mynorms in mynorms1, mynorms2, mynorms3, mynorms4, mynorms5: ... timenorm(mynorms, 5*10**6/len(V)) mynorms1: 1.076339 time: 52.706 code: [np.linalg.norm(v) for v in V] mynorms2: 1.076339 time: 45.342 code: [np.sqrt(sum(v*v)) for v in V] mynorms3: 1.076339 time: 33.971 code: np.sqrt([sum(v*v) for v in V]) mynorms4: 1.076339 time: 13.125 code: np.sqrt([v[0]*v[0] + v[1]*v[1] + v[2]*v[2] for v in V]) mynorms5: 1.076339 time: 0.212 code: np.sqrt((V*V).sum(axis=1))

The speedup `mynorms5`

provides here will be even larger if we put all five million vectors into a single array. Of course, compiled C will be even faster, but this is more than enough for most of my quick and dirty scripts.

Although I've used Debian for many years, I've never built packages or maintained them. My entry points were the debichem packaging group, the Debian mentors website, and of course the maintainer's guide. All these were very helpful in getting the job done efficiently.

So, I'm happy to report that Debian users (of the current testing distribution, aka *wheezy*) can install cclib even easier than before, by typing one command at the terminal:

aptitude install cclib

or via their favorite software package manager. This actually installs two packages, python-cclib containing the core Python module, and cclib which carries the user scripts. Due to current and possible future conflicts in names, these user scripts have prefixed with *cclib-*; that means that instead of *ccget* users run *cclib-ccget* and that *cda* has been changed to *cclib-cda*.

If you are also interested in the logfiles distributed with cclib and the accompanying unittests, you will need to install cclib-data from the *non-free* repository. This is due to copyright issues, since the log files created by many computational chemistry programs are not free to use and distribute under all conditions (see debichem-devel mailing list from July 2011 for the relevant discussion).

Using cclib within Python is the same as always. For examples, with all packages installed you can type this in the interpreter:

>> import cclib >> cclib.test.testall()

which should run the whole cclib unittest suite.

]]>