Java Concurrency (&c): Scala / Java Shootout

Friday, June 3, 2011

Scala / Java Shootout

I should probably comment on this paper from my colleague Robert Hundt, where he writes the same benchmark in Java, Scala, Go, Python and C++ (TL;DR here). Short version: Initially, C++ was faster than Scala, and Scala was faster than Java. I volunteered to make some improvements, which ended up with the C++ and Java implementations being about the same speed. Some C++ hackers then went and improved the C++ benchmark by another ~6x, and I didn't think I should go down that road.

Imagine my surprise when I got this email from a friend of mine this morning:

Date: Fri, 3 Jun 2011 03:20:48 -0700
Subject: Java Tunings
To: Jeremy

"Note that Jeremy deliberately refused to optimize the code further,
many of the C++ optimizations would apply to the Java version as
well."

Slacker.

Gee, thanks for making me look good, Robert. (Actually, I had approved that language, but I didn't realize anyone would actually read the paper. :) )

Anyway, I was not just being obdurate here; there was a thought process involved. I should probably document it.

The benchmark did not follow the rules of Java benchmarking. At some point, I should write a post about the rules of Java benchmarking (maybe I have? I don't know). Basically, it timed the full execution of the program, including startup time, JIT compilation time, and GC time. This isn't what you are supposed to do in Java or Scala, so I was very uncomfortable with the experimental setup. In 2011, if you care about startup time, then you probably shouldn't be using Java. The program ran very briefly, so the effect was measurable (once I got it to run fast enough).ETA: I am informed that later versions of this benchmark took this into account. Apologies for the confusion.

It's kind of silly to compare C++ with Java. The paper is about Scala, of course, but the immediate internal reaction at Google was a discussion of how much worse Java was than C++. As it happens, they really do target two different niches. Programmers are more productive in Java (if they know both languages well). It's easier to test Java code. It scales to large code sizes better. The tradeoff is that you are only going to get gcc -O2 level performance, and you have to be careful about garbage collection. In many, many cases, this tradeoff isn't a big deal. A single-threaded performance arms race between the two doesn't make sense. This is the primary reason I stopped trying to make improvements.

The benchmark did not use Java collections effectively. It was generating millions of unnecessary objects per second, and the runtime was swamped by GC overhead. In addition to the collections-related fixes I made (as described in the paper), the benchmark also used a HashMap<Type, Integer> when it should have just stored a primitive int in Type; it also used a LinkedList where it should have used an ArrayDeque. In fact, after being pressed on the matter by other Googlers (and seeing the response the C++ version got), I did make these changes to the benchmark, which brought the numbers down by another factor of 2-3x. The mail I sent Robert about it got lost in the shuffle (he's a busy guy), but he later said he would update the web site.

It is this last point that made the difference between Java and Scala. They use the same runtime, so the compiled performance is going to be roughly the same. However, the choice of data structure matters! From the perspective of Scala vs. Java performance, the takeaway isn't that "Java is slow", it is that "you can write bad code in any language" (with apologies to Robert, who is one of the smartest guys around, but isn't a Java programmer).

It is also "you can write bad code in any language" that inspired me to make any improvements. Robert was not a C++ newbie, but he was a Java newbie, so I thought that the comparisons that this was inspiring between C++ and Java performance basically made no sense. I didn't make the changes to demonstrate that Java could do better than C++, as I know that's not true. I made them to show that the playing field wasn't quite as uneven as people tend to think, and that algorithmic changes tend to make more difference than language changes.

These observations don't actually contradict the point that Robert was trying to make. In fact, they underline it. He was trying to say that Scala is very newbie-friendly when you want performance, as it balances the two well. That was clearly true in this case: as a newbie in both Java and Scala, Robert chose more performance-friendly data structures in Scala than in Java. In fact, in large measure, Scala chose the data structures for him, which is probably even better.

The other languages strike terrible balances in this area. Python and Go aren't on the performance radar at all (Python isn't compiled, and Go is young, yet). C++ isn't newbie friendly at all, ever.

Robert, of course, is a very odd choice for newbie, as he is an experienced compiler writer. Havlak is also an odd choice for newbie program, as it is a relatively sophisticated compiler benchmark. Perhaps the moral turns out to be that newbies who are also experiened compiler writers should look carefully at Scala?

36 comments:

Anonymous said...: Thanks for this post.
I've stoppped counting the number of incidents where students came along with this "java is always slower" attitude.
When asked, all of them stated they never look into the API/source and none of them ever used a profiler to improve the code.; June 3, 2011 at 11:51 AM
Stanimir Simeonoff said...: My experience with java: it's blazing fast... when you know what you do. When you don't - not so much (I guess).

Still surprising a compiler developer would even consider any linked list alike structure. Unless an insertion/deletion in the middle it's the most common operation, a linked list is going to lose badly to a cyclic array (aka ArrayDeque). Then trees (with linked nodes similar to LinkedHashMap) tend to perform better in cases like insert in the middle due to faster search of the node.

Today I had some bad experience helping a colleague to remove leaks due to cycle references in Objective-C (code ported from java). Memory managing is just hard even w/ built in ref-counting.
Yet, that's not the point, it's amazing how people still have no clue how to benchmark java (or anything). I'd gladly give up 5-10% speed to get a free memory manager.; June 3, 2011 at 12:59 PM
Jeremy Manson said...: @Stanimir - you have a very good point. Garbage collection and performance only go well together if you understand how the garbage collector works.

I don't think the blame lies with Robert. He's not a Java programmer, and was using the obvious data structures. Why would he even think that an ArrayDeque exists? Knowing the collections classes well requires some experience.; June 3, 2011 at 1:39 PM
Stanimir Simeonoff said...: I agree that some knowledge how Java (+GC/allocation) works is essential for achieving decent results, and I especially agree that one can write horrible code in any language (plus the old saying that you can write in Java in any language too).

As for LinkedList vs ArrayDeque: I meant that linked structures are overall worse nowadays due to virtually guaranteed cache miss per element on traverse, plus inability for an efficient random access, on top of higher memory consumption. While being simple they are ineffective. Here, I exclude LinkedHashMap and CLQ since they are special. (Probably LBQ as well)
I just don't remember the last time I used a vanilla LinkedList for any remotely serious job.
Having a link between the nodes of a different structure, similar to LinkedHashMap (or a custom red-black tree) is very useful, though.

So in any language if I have to pick data structure I look at the impl. (provided it's available) or read the doc about how it's really implemented. Linked list alike would be at the very bottom of the stack of the choice, thus looking for an alternative. Java makes it quite obvious with the name as well

Rolling custom data structures when the standard APIs don't suit the needs is probably an easy option for a compiler developer. Cyclic arrays are (half) an hour work. Primitive ones (like double) are possibly the ones that pay the most handsomely. Yet, that's an out of newbie land solition.; June 3, 2011 at 2:44 PM
Isaac Gouy said...: > > The program ran very briefly, so the effect was measurable (once I got it to run fast enough). < <

Here are the numbers (on an older Pentium IV workstation) from Fig. 8

290 seconds Java 32-bit
134 seconds Java 64-bit
106 seconds Java 32-bit GC*
89 seconds Java 32-bit SPEC GC

One and a half minutes doesn't seem very brief compared to JVM startup of about 0.085 seconds on that old hardware.; June 4, 2011 at 8:17 AM
Jeremy Manson said...: @Isaac - I believe I mentioned that that figure doesn't include my optimizations. When I was done optimizing, on my machine, it was about 9 seconds, compared to C++'s post-optimization 3 seconds. I was using a late model Core based Xeon, though, not a P4.

Also, startup costs include JIT quiescence. Did that happen in 0.085 seconds?; June 4, 2011 at 8:33 AM
Isaac Gouy said...: This comment has been removed by the author.; June 4, 2011 at 8:40 AM
Stanimir Simeonoff said...: @Isaac
If you need fast startup w/ java you need -client and shared, pre-compiled JRE classes. The client compiler generates subpar code anyways and it's sort of unavailable for 64bit. (JDK ships w/o the client version)

Measuring the entire running time also includes all the time for profiling/optimization and code generation, +disk access. If you need fast starup with java you'd be better off running a service that intercepts the code and runs it. That would be the closest you can get compared to C (with flushing the caches just before the test).

In other words java consists of a (tiered) compiler + profiler and on the stack replacement. In C you compile the program once and ship the binary, in java it happens each time it's run.; June 4, 2011 at 9:05 AM
Isaac Gouy said...: >>Basically, it timed the full execution of the program, including startup time, JIT compilation time, and GC time.<<

1) The paper says -

"... performs loop recognition on it for 15.000 times. The purpose is to force Java/Scala compilation, so that the benchmark itself can be measured on compiled code.

Then the driver constructs a large control flow graph containing 4-deep loop nests with a total of 76000 loops. It performs loop recognition on this graph for 50 times."

2) The code says -

HavlakLoopFinder.java shows currentTimeMillis() recorded at the beginning and end of the findLoops() method.

LoopTesterApp.java shows MaxMillis() and MinMillis() are actually reported for that second "Another 50 iterations..." after the program had already churned 15,000 times.; June 4, 2011 at 9:35 AM
Jeremy Manson said...: It is certainly possible that I was going off of an earlier version of the benchmark. I'll update accordingly.; June 4, 2011 at 10:24 AM
Stanimir Simeonoff said...: Isaac, that's the very unoptimized version.
I tried my hand w/o going too deep to shrink down to around 7-8sec.
The optimization is here (incl. the sources)

The original code is, indeed, highly unoptimized. Yet, the attempt to optimize by using static collections in order to prevent the GC (between the run) is somehow ugly but I kept it. Like Jeremy tells, it's a very funny type of benchmark.

Printout of the test results:; June 5, 2011 at 2:22 AM
Jeremy Manson said...: Actually, the static collections were something I put in there, as the original was using HashMap, and thereby creating millions of Integer objects a second. This was the quickest way I could think of to resolve that problem.; June 5, 2011 at 10:26 AM
Isaac Gouy said...: @Stanimir Simeonoff

I'm not sure why you feel the need to teach me Java basics :-)

'A new feature in this version of HotSpot is "tiered" compilation in the Server VM that enables it to start quickly as does the Client VM, while achieving superior peak performance. This feature is enabled by specifying -server and -XX:+TieredCompilation command options.'; June 5, 2011 at 10:58 AM
Stanimir Simeonoff said...: Jeremy, perhaps you can blog some on proper use of datastructures in Java. I believe even "Effective Java" doesn't address the matter properly. Using the right structure is vital for effectiveness. I know it's not related to concurrency but you do write very well.

The benchmark had it all wrong. HashSet w/o hashcode/equals (although the instances are unique and 'name' would a good hash). HashMap overall is quite poor data structure. Set of IdentityHashMap would have been better there, although it's guaranteed that the objects are unique, so ArrayList fits the best (SimpleLoop children). Most of all: mapping anything to Integer is just bad and ineffective. Using LinkedList tends not to be the optimum in virtually any case too.

In the end: thanks for posting, it made for a funny Saturday night.; June 5, 2011 at 11:44 AM
Jeremy Manson said...: In the end: thanks for posting, it made for a funny Saturday night.

Our efforts, as always, are purely to entertain.; June 5, 2011 at 5:10 PM
Isaac Gouy said...: 1) I'm always entertained by - The benchmark did not follow the rules of Java benchmarking excuse.

I'm amused that we should just not count GC time instead of reducing GC, and that JVM startup is so horrendous on your machine that it's a significant portion of even your 9 second measurement :-)

Are you trying to send the message that JVM startup is that bad ;-)

2) Programmers are more productive in Java (if they know both languages well).

Do we really know that? Please share.

3) As you said "[t]he benchmark did not use Java collections effectively" and "that made the difference".

With regard to your gloss on "the point that Robert was trying to make" - I haven't been able to find where in his paper he suggests that was the point he was trying to make.

I think the basic problem with the paper is apparent here - "All four implementations stay very close to the formal specification of the algorithm and do not attempt any form of language specific optimization or adaption. ... We believe that this approach ... and allows an almost fair comparison along the dimensions of ..."

What does he do to show that approach allows "an almost fair comparison" - nothing.

It's simply his definition of "fair comparison".

We might think "fair comparison" allows C++ programmers to do things their way and Java programmers to do things their way.

(At least that would have the virtue of reflecting how C++ programmers and Java programmers are likely to behave in a workplace.); June 6, 2011 at 9:08 AM
Jeremy Manson said...: I'm not going to dive into this too deeply. I will reiterate that I include time spent interpreting and doing JIT compilation in JVM startup, and that I retracted that statement (as above) when you pointed out that the benchmark does a warmup phase.

I will also point out that my statement that programmers are more productive in Java than in C++ is simply borne from my experiences with both languages. At Google, I've had a closeup view on a shop that does a LOT of coding in both. I work in our Engineering Productivity division. The level of frustration with C++ is very high. I could go into details about my particular views - the lack of decent development tools; the need for everyone to memorize several volumes of Effective C++; the STL; header files - but if your experience is otherwise, you are welcome not to agree. I think it is Conventional Wisdom (tm), and, while it may be wrong, it is not worth debating in the comment thread for this post.

FWIW, Robert drew the same conclusion when writing the paper (although it might not have been stated thus in the paper).

It is possible that I may have reflected something that isn't explicitly stated in the paper when talking about Robert's point. I know that he feels that Scala has the potential to improve the level of abstraction for programmers, while maintaining at least Java-like performance. Much of his excitement focuses on Scala's ability to embed DSLs, and I'm pretty sure that he does think writing, say, library code in Scala is more complicated.; June 6, 2011 at 10:07 AM
Isaac Gouy said...: OK.

I'll just finish with the suggestion that your "didn't realize anyone would actually read the paper" quip is probably quite close to what's happened.

Seems like few of us read the paper well, we kind of glanced at it - I needed half-a-dozen attempts before I succeeded in finding Figure 3.

And I'm still not convinced that I understand exactly what the run-time measurements shown in Fig 8. actually correspond to in the Java source code at code.google.com; June 7, 2011 at 9:05 AM
Jeremy Manson said...: @Isaac - I'm not sure what the status of the code is, either. There are at least three possible versions:

1) Robert's initial pass
2) My initial revisions
3) My subsequent revisions

My initial revisions have hand-build versions of HashMap that permit scalars instead of boxed Integers. My subsequent revisions use ArrayDeque, as well. The paper uses (2) as Java Pro; he said he was going to put (3) online, but I haven't checked.

Pretty much everything in Figure 3 originally used boxed values, so you can imagine the GC overhead.; June 7, 2011 at 11:34 AM
Jeremy Manson said...: I *think* Figure 8 reflects the original code. He also mentions a -XX:+CycleTime flag, which I don't see him mentioning is a Google-specific flag I added.; June 7, 2011 at 11:36 AM
Pierre Queinnec said...: Hi Jeremy,

Thanks for your post. I tried to find CycleTime in arguments.cpp but couldn't, obviously as per your last comment. Would you care to explain what it does?

Thanks a lot!; June 7, 2011 at 2:16 PM
Jeremy Manson said...: @Pierre - it uses a System.currentTimeMillis/nanoTime mechanism based on the processor timestamp counter and some magic kernel juju, rather than the overweight mechanism that is the current implementation.

Cliff Click talks in general about the problem that this fixes at 19:45 in this video:

http://www.youtube.com/watch?v=uL2D3qzHtqY; June 7, 2011 at 2:22 PM
Pierre Queinnec said...: Thanks so much for the info! That's what I thought of when trying to think about "CycleTime"; a quicker currentTimeMillis based on TSC is definitely a big win. Do you think this could land in the 'official' Mercurial?; June 7, 2011 at 3:31 PM
Jeremy Manson said...: I doubt it. It requires some assumptions about the operating environment that shouldn't get made if you don't control said operating environment. IIRC, Cliff has a good bit in the talk about what it would take to make it widely practicable.; June 7, 2011 at 3:39 PM
Sebastiano Vigna said...: As soon as I read the paper, I knew there was something horribly wrong. I just changed the collections used in the code to fastutil's (overall 20 changed lines!) and now C++ is not even three times faster.

I could play a bit with the profiler to eliminate obvious bottleneck, but the point here is that Java built-in collections are deadly slow on primitive types, and everybody knows it. That's why HPPS or fastutil or Gnu Trove exist.; June 8, 2011 at 8:45 AM
Isaac Gouy said...: I don't suppose Robert Hundt realized that he was providing the world with Google's official assessment of programming languages :-)

Google Rates C++ As The Most Complex, Highest Performing Language

C++ clear winner in Google language tests; June 8, 2011 at 9:09 AM
Sebastiano Vigna said...: A final comment. After having implemented findSet() recursively, and having replaced the creation of >200000 object at each call to findLoops() with object reuse, the wall clock of Java is *smaller* than that of C++. User time is marginally higher due to the parallel garbage collector.

Java is faster than C++ on this benchmark.; June 8, 2011 at 9:44 AM
Jeremy Manson said...: @Sebastiano - Are you looking at the final optimized version of C++? When I tried fastutil, it was:

- Faster than C++'s original version,
- Slower than the handmade data structures I created, and
- Slower than C++ with all of the optimizations.; June 8, 2011 at 9:56 AM
Sebastiano Vigna said...: Unfortunately I discovered your version only after having made the post. :)

Yes, obviously fastutil is slower, as it still provides a hash table. The set you implemented is an array-based set, well-suited for this case, in which sets are ridicolously small. But your IntegerSet class has linear insertion time and query time, and it would never work as a generic data structure as soon as the size of the set would grow beyond a few hundreds elements.

The point of my implementation was not to write the best possible Java code for this application, as I wrote to Robert, but just to show that substituting bad classes with good off-the-shelf components, and changing just a few lines which were trivial mistakes in garbage collection, Java was faster than C++. I modified 48 lines, and 50% of the modifications are just changes of class names. This is a ridicolously small changeset—really, you can't call it "optimization", in any sense—, and nonetheless Java was already faster than C++.

Unfortunately the optimized C++ version is not available, as it is based on Google's internal data structures. If you can give me an idea of the relative timings on your hardware, I could come up with some further improvements. It is also important to understand whether the algorithm has been modified, as this is not language optimization—it is changing the algorithm, and the same changes should be applied to the Java version.

But what it's bad—really, really bad—is that the net is flooded with IT blogs and sites saying things like "C++ clear winner in Google language tests". This is ridiculous. It's not Google that made the test, it's a single google employee. And the benchmark are really, really flawed. And the conclusions are wrong. Unfortunately, Google has built an image that is very similar to the state of Aristotle in Medieval ages: "ipse dixit". If someone in some way connected to Google said it, it must be true. Nobody will even check the benchmarks... :(; June 8, 2011 at 2:52 PM
Sebastiano Vigna said...: BTW, I forgot (you know, it's almost 2000 classes :) that fastutil has array-based sets, too. I replaced IntOpenHashSet with IntArraySet in the same positions in which you use your IntegerSet, and I got a 20% improvement (now I'm about 10% slower than your version). The number of line changed of course is still the same.; June 8, 2011 at 3:24 PM
Jeremy Manson said...: For what it is worth, Google didn't say it. Robert said it in a paper he recognizes isn't exactly comprehensive. It was just a bit of fun that went too far. Google has no official opinion about which programming language you should use. That's why NativeClient exists. :)

My recollection was that on my machine, the optimized version I first gave Robert ran in about 20 seconds, as did the original C++ code. The optimized C++ code ran in about 3 seconds, and my second optimized version (after added ArrayDeques and so on) ran in about 9 seconds. As I said, I doubt that this is really worth pursuing further.; June 8, 2011 at 11:57 PM
Charles said...: "The benchmark did not follow the rules of Java benchmarking. At some point, I should write a post about the rules of Java benchmarking (maybe I have? I don't know). Basically, it timed the full execution of the program, including startup time, JIT compilation time, and GC time."

Why it shouldn´t include GC time? As with C++, memory (and other resources) allocation and deallocation has a cost. Ignoring it would make an uneven comparison.; June 9, 2011 at 1:15 AM
Jeremy Manson said...: @Charles - Sometimes it should, sometimes it shouldn't. It depends on the benchmark (or, rather, it depends on what you are benchmarking). See my non-existent blog post for details. :)

In this case, it probably should include that overhead, but clearly GC hadn't been taken into account when writing the code.; June 9, 2011 at 8:36 AM
Neale said...: What I find quite amazing about this sort of comparison is the failure to recognise the benefits of the classloading approach.

Because I can write an aspect do some load-time weaving on Java, I can apply caching on class and/or method named pattern matching across a whole application. I can also do that on the basis of picking up certain annotations at runtime.

For testing multi-threaded applications, there are libraries that will force yields between threads by doing bytecode manipulation.

On the JVM this is childs play. For C++, my distant memory of when I was writing graphics engines in it, still doesn't enlighten me on how I'd achieve something comparable.; June 12, 2011 at 2:32 PM
Rex said...: The benchmark results are somewhat interesting, but I do wonder how they are significantly more illuminating than the various benchmarks on the Computer Languages Benchmark Game. Do you know if Robert was simply unaware of the site, or if there was a particular question that could not be answered by referring to the results there? I couldn't find a clear answer to this in the paper.; June 15, 2011 at 10:24 AM
Jeremy Manson said...: @Rex - Robert was just having a bit of fun. I don't think he meant for any of this to be taken as seriously as it was.

But in direct answer to your question: he never mentioned that site when discussing this paper with me.; June 15, 2011 at 11:35 AM