This will be a simple exercise I did a long time ago in speeding up a single operation repeated many times, namely the Euclidean vector norm. It can become quite a bottleneck if done wrong, especially if you deal with millions of vectors in a Python script written quick and dirty.
There are tools for speeding up Python with C, such as weave and Cython, or with Fortran (see f2py). But the whole point in quick and dirty solutions is to stay away from them, since writing pure Python is simpler. The first thing to do is to use NumPy effectively, and that is what this note is about.
A single vector
A random vector, a one dimensional
numpy.ndarray object with three random elements:
>>> import numpy as np >>> v = np.random.random((3,)) >>> print v [ 0.21683143 0.47678871 0.48953654]
We will need to compare different ways of computing the norm. I will define lambda functions, which can be inspected and timed like this:
>>> import inspect >>> import timeit >>> def timenorm(norm, number): ... name = inspect.getsource(norm).split("=").strip() ... code = inspect.getsource(norm).split(":").strip() ... setup = "from __main__ import np,v" ... tim = timeit.timeit(code, setup, number=number) ... value = eval("%s(v)" %name) ... print "%s: %.6f time: %.3f code: %s" %(name, value, tim, code)
This timing function will print the value of the norm (a sanity check), the time it took for number of repetitions, and the code actually executed.
The first thing to try is the norm provided with numpy.linalg:
>>> mynorm1 = lambda v: np.linalg.norm(v) >>> timenorm(mynorm, 5*10**6) mynorm1: 1.144364 time: 51.915 code: np.linalg.norm(v)
But numpy.linalg does a number of things we don't need, so let's try a few other versions:
>>> mynorm2 = lambda v: np.sqrt(np.sum(np.dot(v,v))) >>> mynorm3 = lambda v: np.sqrt(np.sum(v*v)) >>> mynorm4 = lambda v: np.sqrt(v*v + v*v + v*v) >>> for mynorm in mynorm1, mynorm2, mynorm3, mynorm4: ... timenorm(mynorm, 5*10**6) mynorm1: 0.716931 time: 52.390 code: np.linalg.norm(v) mynorm2: 0.716931 time: 48.008 code: np.sqrt(np.sum(np.dot(v,v))) mynorm3: 0.716931 time: 46.823 code: np.sqrt(np.sum(v*v)) mynorm4: 0.716931 time: 21.482 code: np.sqrt(v*v + v*v + v*v)
It seems that the explicit version, without the additional function calls, is somewhat quicker if you have a single vector, although it will not support vectors of arbitrary length. Reasonable, but meaningless since in all cases we are dealing with microseconds per function call.
The whole point of NumPy is to get rid of loops by vectorizing, and in practice one typically deals with large sets of different vectors. So it would make more sense to benchmark on an array of vectors:
>>> V = np.random.random((1000,3)) >>> print V [[ 0.73755195 0.78344111 0.02725284] [ 0.49455093 0.08837641 0.78106238] [ 0.97095203 0.64497806 0.53856876] ..., [ 0.67676871 0.41127143 0.89213647] [ 0.50376334 0.01370871 0.35758737] [ 0.05427026 0.42527007 0.88730196]]
mynorm3 translated into list comprehensions:
>>> mynorms1 = lambda V: [np.linalg.norm(v) for v in V] >>> mynorms2 = lambda V: [np.sqrt(sum(v*v)) for v in V]
In the second case it will be more efficient to take the square root for the whole resulting array at once. Here is that modification, and a similar one for
>>> mynorms3 = lambda V: np.sqrt([sum(v*v) for v in V]) >>> mynorms4 = lambda V: np.sqrt([v*v + v*v + v*v for v in V])
And finally, a compact version that sticks with arrays all the way:
mynorms5 = lambda V: np.sqrt((V*V).sum(axis=1))
Now a comparison (
timenorms is available in the attached script):
>>> for mynorms in mynorms1, mynorms2, mynorms3, mynorms4, mynorms5: ... timenorm(mynorms, 5*10**6/len(V)) mynorms1: 1.076339 time: 52.706 code: [np.linalg.norm(v) for v in V] mynorms2: 1.076339 time: 45.342 code: [np.sqrt(sum(v*v)) for v in V] mynorms3: 1.076339 time: 33.971 code: np.sqrt([sum(v*v) for v in V]) mynorms4: 1.076339 time: 13.125 code: np.sqrt([v*v + v*v + v*v for v in V]) mynorms5: 1.076339 time: 0.212 code: np.sqrt((V*V).sum(axis=1))
mynorms5 provides here will be even larger if we put all five million vectors into a single array. Of course, compiled C will be even faster, but this is more than enough for most of my quick and dirty scripts.