I've haven't found anything about this on the web yet, and it caused me to loose some data and time, so I'm issuing a warning here. Using zip to pack and compress files on a glusterfs partition can corrupt the resulting archive.
me@machine: ~/glusterfs/ziptest$ dd if=/dev/zero count=1 of=test.dat 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.000357 seconds, 1.4 MB/s me@machine: ~/glusterfs/ziptest$ zip test.zip test.dat adding: test.dat (deflated 98%) me@machine: ~/glusterfs/ziptest$ ls -lh total 16K -rw-r--r-- 1 kml kml 512 Oct 6 13:47 test.dat -rw-r--r-- 1 kml kml 67 Oct 6 13:58 test.zip me@machine: ~/glusterfs/ziptest$ unzip test.zip Archive: test.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of test.zip or test.zip.zip, and cannot find test.zip.ZIP, period.
Although the error pops up while unzipping, the archive itself is corrupted. This can be seen by copying the archive to another, non-glusterfs partition, where the error still occurs. A file zipped on a different partition and copied to glusterfs, however, will unzip nicely.
I haven't studied the cause of this corruption, but I presume it is connected with the central directory file header. The glusterfs setup in this case uses a simple distributed configuration, so it is not an issue with striping, although I haven't looked into any other configuration options. My personal solution was to abandon the application that required zip, and to use tar with gzip or bzip2 instead.
Yes and no. For some time I've had an urge to blog. I follow quite a few and most are concerned with science and research, or with computing and programming. Regularly I stumble upon technical information on blogs that I find helpful in my work. In a way, this website is about me giving back to that community. It's also about organizing the various technical notes I've put down in many places.
Strictly speaking I will retain the core features of a blog, namely reverse-chronological order and some form of commenting. The infrastructure, however, is based on Blogofile, which is a static website compiler written in Python, and the ubiquitous git revision control system. In fact, I've decided to version control the whole source code for the website and to publish it openly on github.
I will refer to the things I write here as notes, and not posts as in typical blogs, because most will be technical and short, and they will probably be published irregularly. Also, I might go back and update, change, correct and extend notes in the future. But since the source is versioned, it is also possible to go back in time if needed.
This will be a simple exercise I did a long time ago in speeding up a single operation repeated many times, namely the Euclidean vector norm. It can become quite a bottleneck if done wrong, especially if you deal with millions of vectors in a Python script written quick and dirty.
There are tools for speeding up Python with C, such as weave and Cython, or with Fortran (see f2py). But the whole point in quick and dirty solutions is to stay away from them, since writing pure Python is simpler. The first thing to do is to use NumPy effectively, and that is what this note is about.
A single vector
A random vector, a one dimensional
numpy.ndarray object with three random elements:
>>> import numpy as np >>> v = np.random.random((3,)) >>> print v [ 0.21683143 0.47678871 0.48953654]
We will need to compare different ways of computing the norm. I will define lambda functions, which can be inspected and timed like this:
>>> import inspect >>> import timeit >>> def timenorm(norm, number): ... name = inspect.getsource(norm).split("=").strip() ... code = inspect.getsource(norm).split(":").strip() ... setup = "from __main__ import np,v" ... tim = timeit.timeit(code, setup, number=number) ... value = eval("%s(v)" %name) ... print "%s: %.6f time: %.3f code: %s" %(name, value, tim, code)
This timing function will print the value of the norm (a sanity check), the time it took for number of repetitions, and the code actually executed.
The first thing to try is the norm provided with numpy.linalg:
>>> mynorm1 = lambda v: np.linalg.norm(v) >>> timenorm(mynorm, 5*10**6) mynorm1: 1.144364 time: 51.915 code: np.linalg.norm(v)
But numpy.linalg does a number of things we don't need, so let's try a few other versions:
>>> mynorm2 = lambda v: np.sqrt(np.sum(np.dot(v,v))) >>> mynorm3 = lambda v: np.sqrt(np.sum(v*v)) >>> mynorm4 = lambda v: np.sqrt(v*v + v*v + v*v) >>> for mynorm in mynorm1, mynorm2, mynorm3, mynorm4: ... timenorm(mynorm, 5*10**6) mynorm1: 0.716931 time: 52.390 code: np.linalg.norm(v) mynorm2: 0.716931 time: 48.008 code: np.sqrt(np.sum(np.dot(v,v))) mynorm3: 0.716931 time: 46.823 code: np.sqrt(np.sum(v*v)) mynorm4: 0.716931 time: 21.482 code: np.sqrt(v*v + v*v + v*v)
It seems that the explicit version, without the additional function calls, is somewhat quicker if you have a single vector, although it will not support vectors of arbitrary length. Reasonable, but meaningless since in all cases we are dealing with microseconds per function call.
The whole point of NumPy is to get rid of loops by vectorizing, and in practice one typically deals with large sets of different vectors. So it would make more sense to benchmark on an array of vectors:
>>> V = np.random.random((1000,3)) >>> print V [[ 0.73755195 0.78344111 0.02725284] [ 0.49455093 0.08837641 0.78106238] [ 0.97095203 0.64497806 0.53856876] ..., [ 0.67676871 0.41127143 0.89213647] [ 0.50376334 0.01370871 0.35758737] [ 0.05427026 0.42527007 0.88730196]]
mynorm3 translated into list comprehensions:
>>> mynorms1 = lambda V: [np.linalg.norm(v) for v in V] >>> mynorms2 = lambda V: [np.sqrt(sum(v*v)) for v in V]
In the second case it will be more efficient to take the square root for the whole resulting array at once. Here is that modification, and a similar one for
>>> mynorms3 = lambda V: np.sqrt([sum(v*v) for v in V]) >>> mynorms4 = lambda V: np.sqrt([v*v + v*v + v*v for v in V])
And finally, a compact version that sticks with arrays all the way:
mynorms5 = lambda V: np.sqrt((V*V).sum(axis=1))
Now a comparison (
timenorms is available in the attached script):
>>> for mynorms in mynorms1, mynorms2, mynorms3, mynorms4, mynorms5: ... timenorm(mynorms, 5*10**6/len(V)) mynorms1: 1.076339 time: 52.706 code: [np.linalg.norm(v) for v in V] mynorms2: 1.076339 time: 45.342 code: [np.sqrt(sum(v*v)) for v in V] mynorms3: 1.076339 time: 33.971 code: np.sqrt([sum(v*v) for v in V]) mynorms4: 1.076339 time: 13.125 code: np.sqrt([v*v + v*v + v*v for v in V]) mynorms5: 1.076339 time: 0.212 code: np.sqrt((V*V).sum(axis=1))
mynorms5 provides here will be even larger if we put all five million vectors into a single array. Of course, compiled C will be even faster, but this is more than enough for most of my quick and dirty scripts.
I'm involved in the development of cclib, which is a Python library for parsing computational chemistry output files, and progress has been sporadic at best. Nonetheless, after a few years the version number is above 1.0, the interface is quite stable and most of the commits now are bugfixes. More importantly, it seems we have acquired quite a bit of users, especially via GaussSum and QMForge, which are basically graphical user interfaces for cclib. This year I decided it is time to finally introduce cclib into Debian, my Linux distribution of choice.
Although I've used Debian for many years, I've never built packages or maintained them. My entry points were the debichem packaging group, the Debian mentors website, and of course the maintainer's guide. All these were very helpful in getting the job done efficiently.
So, I'm happy to report that Debian users (of the current testing distribution, aka wheezy) can install cclib even easier than before, by typing one command at the terminal:
aptitude install cclib
or via their favorite software package manager. This actually installs two packages, python-cclib containing the core Python module, and cclib which carries the user scripts. Due to current and possible future conflicts in names, these user scripts have prefixed with cclib-; that means that instead of ccget users run cclib-ccget and that cda has been changed to cclib-cda.
If you are also interested in the logfiles distributed with cclib and the accompanying unittests, you will need to install cclib-data from the non-free repository. This is due to copyright issues, since the log files created by many computational chemistry programs are not free to use and distribute under all conditions (see debichem-devel mailing list from July 2011 for the relevant discussion).
Using cclib within Python is the same as always. For examples, with all packages installed you can type this in the interpreter:
>> import cclib >> cclib.test.testall()
which should run the whole cclib unittest suite.