September 11, 2007

Benchmarking Parallel Python Against Jython Threading (Benchmarks Take 3)

Follow-up to Benchmarking Parallel Python Against Threading from The Utovsky Bolshevik Show

Having had it pointed out to me that benchmarking against CPython threading is pointless, I am now going to do what I should have done originally (third time's the charm, right?) and benchmark Parallel CPython against threaded Java, in the hopes I will fail less at producing something useful.

Each of these results is the time it takes to sum the prime numbers below each multiple of 10000 between 100000 and 1000000 (i.e. perform the operation 90 times on numbers increasing by 10000 each time). 

I'm reusing the Parallel Python results from previously.

I decided to use Tim Lesher's cookbook recipe to test threads, as I already have a script which doesn't require a great deal of rewriting to make it Jython (i.e. CPython 2.2 or so) compatible.

Now, the results:

1 Worker
2 Workers
4 Workers
Vanilla CPython
Parallel CPython
Jython Threads

As can be seen here, Jython threads by far and away beat Parallel CPython.  This does not, however, take into account the fact that Parallel Python can use several machines at once, which Jython threading obviously cannot do.

What's interesting to note is that Parallel CPython on one worker is roughly the same as standard GIL'd CPython (slightly faster, in fact, in this case).  If you need to write and deploy CPython as opposed to Jython, then there's no performance cost in writing parallelisable code to use Parallel Python regardless of end user (as PP, by default, spawns a number of workers equal to the available CPUs).

These statistics were taken on an IBM Thinkpad T60 with a Core Duo T2400 running Ubuntu Feisty GNU/Linux (using the standard packages where available) using the scripts found under . 

Hopefully these are useful statistics and conclusions, as opposed to my previous efforts to produce such. :) 

- 17 comments by 1 or more people Not publicly viewable

[Skip to the latest comment]
  1. Drew Vogel

    The reason I suggest running an IO-bound test is because in theory the threads running on the stock python interpreter could see a performance increase due to one thread sleeping while the IO blocks, allowing the other thread to execute. This is the approach of the State Threads C library. I’m curious whether the stock python interpreter is able to facilitate this type of “parallelism” or whether the OS will put the entire python interpreter to sleep while the script waits on IO. Therefore be sure to publish what OS you run the benchmarks on. Also post the scripts so that other people can run them on other OSes.

    12 Sep 2007, 00:06

  2. This particular set of tests was run on Ubuntu Feisty GNU/Linux. I’ll look into publishing the scripts, but I think I may have cannibalised one of them to produce another. I’ll check. I should be able to reconstitute the scripts I used regardless.

    As for IO-bound tests, I can’t think of an effective one of the top of my head so suggestions would be appreciated.

    12 Sep 2007, 00:14

  3. Curious

    Hmm. Puzzling. What does the benchmarked task actually do?

    12 Sep 2007, 00:17

  4. andy

    Hmm… Jython was 2.6x faster than parallel CPython for one worker? Those results seem to show that Jython is just plain faster in general, not that its multithreading is more efficient.

    12 Sep 2007, 00:27

  5. The three scripts used can be found under

    They work out the sum of the prime numbers below each multiple of 10000 between 100000 and 1000000.

    I’ll edit the post to reflect these facts.

    12 Sep 2007, 00:33

  6. Robert

    Yes, but threading was discarded as the “answer”. Real threading is hard. That is the whole reason Bruce is going after parallelization. So while it is nice you did this…it isn’t what was wanted.

    12 Sep 2007, 01:14

  7. Robert: I don’t really understand what you mean. My interest here was primarily to see if Parallel Python was a reasonable solution to this problem, as Bruce suggests it might be here. Currently, as I understand it, threading is ‘the next best thing’ so it makes sense to see how much slower this approach is when compared to that.

    12 Sep 2007, 01:27

  8. Juergen Brendel

    Java has far superior loop and arithmetic performance, compared to Python. Even Psyco can’t help too much in that respect. Here are some test I did on that. Sadly, I compared that to pure Java, not Jython. But a recent, inofficial test showed me that the good loop and arithmetic performance for Java translates to Jython as well, as could be expected.

    One of the nice things about Jython is that you can continue to use the easy threading API. With that, you have a choice of using queues for communication between threads, or to use shared data structures with explicit locking if you choose to do so. Does PP give you similar options?

    12 Sep 2007, 02:44

  9. Guido van Rossum

    Thanks for getting this started! Juergen already pointed out that you’re probably testing Java’s arithmetic more than anything else.

    Regarding the effect of the # of workers, how many CPUs (hyperthreaded or not) does your box have?

    How many threads were actually created for the entire test? I’m much more interested in the cost of creating threads than in the performance of a bunch of straight CPU-bound threads that don’t communicate (I trust that once the GIL is gone those scale with the # of workers). The other costs I’m interested in the cost of communicating.

    I think you’ll need a better benchmark. In talks about thread performance I’ve heard references to a standard algorithm involving a red-black tree; maybe that’s enough to get you started.

    12 Sep 2007, 15:30

  10. Bruno Gomes


    Are you sure you used the scripts found at for your benchmarks?
    First off, the sum_primes function in is different than the one in Besides, only looks for the sum of prime numbers from 100000 to 100700, using increments of 100, while goes from 100000 to 1000000.
    I also ran the scripts on Windows and the results for CPython and Jython were far from the ones you published. CPython was a lot faster than Jython using 1 worker and I couldn’t run Jython using 2 workers at all (java threw a “out of memory” exception). Was anyone able to reproduce these results?

    12 Sep 2007, 19:53

  11. nes

    First of all congrats for doing this. That said benchmarking is tricky business and when doing it for parallel jobs doubly so. So be prepared to ask (and be asked) a lot of questions before coming to any conclusion.

    “As can be seen here, Jython threads by far and away beat Parallel CPython.” I have to disagree with that. What can be seen is that CPython is faster than Jython for this particular program period. CPython takes 3 times as much in a 1 to 1 comparison to Jython. Since parallel python presumably uses CPython I wouldn’t expect any miracle there. What matters is how good your programs scale when you add more CPUs:
    For example Parallel Python with 4 workes: gives us an improvement of: 1153/582 approximately 98 percent. Jython with 4 workers an improvement of: 442/254 approximately 74 percent. You did not tell us if you actually have 4 cores or not so it is hard at this point to say anything definite, but if actually having 4 times the amount of CPUs speeds the program up only 2 times at best, I would say that scalability is not that great with either method.

    12 Sep 2007, 22:03

  12. nes

    Errata: of course that should have been: “Jython is faster than CPython for this particular program”

    12 Sep 2007, 22:04

  13. Marco Fabbri

    Guido: a pure communication benchmark, as presented in Joe Armstrong’s book “Programming Erlang – Software for a Concurrent World”, could be a “ring benchmark”, i.e. timing how long it takes to have M messages sent accross a ring of N processes.

    13 Sep 2007, 10:56

  14. Guido: I have 2 CPUs (it’s a Core Duo). I’m not sure as to how many threads were created. Certainly PP had n+1 processes running at any given time, but that’s not especially helpful I guess.

    Bruno: No, of course not. I’d edited the Parallel Python results to create my pure CPython test and had to redownload the example forgetting that they used a different set of inputs. I’ve now uploaded a fix.

    nes: As in my response to Guido, I have 2 CPUs. I’ll update the post to reflect this fact.

    13 Sep 2007, 12:50

  15. nes

    For optimum speed you want: number of workers = number of cores. More workers than cores should in theory reduce your overall speed slightly as seen in the threaded example.

    13 Sep 2007, 21:35

  16. nes: That’s the theory. But looking at the Parallel CPython results, it doesn’t necessarily play out in practice (which is why I included a set of data for ‘workers > CPUs’).

    13 Sep 2007, 21:44

  17. Eli

    Thanks a bunch for doing this. I think that both Guido and Bruce Eckel are right; we really need people helping Python libraries get better performance on multi-core systems, and benchmarks are a great way to start.

    14 Sep 2007, 20:15

Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

September 2007

Mo Tu We Th Fr Sa Su
Aug |  Today  | Oct
               1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30

Search this blog



Most recent comments

  • I should note that the talk info is available at an… by on this entry
  • "I X'd you a Y, but I eated it by Lamby on this entry
  • Nice. Did not know it was that easy, I had a few problems to get it working some time ago. Have you … by mats on this entry
  • You can't make progress if you just argue, I try to be constructive. Cheers! by James on this entry
  • Thanks a bunch for doing this. I think that both Guido and Bruce Eckel are right; we really need peo… by Eli on this entry

Blog archive

Not signed in
Sign in

Powered by BlogBuilder