Friday, June 3, 2011

Scala / Java Shootout

I should probably comment on this paper from my colleague Robert Hundt, where he writes the same benchmark in Java, Scala, Go, Python and C++ (TL;DR here). Short version: Initially, C++ was faster than Scala, and Scala was faster than Java. I volunteered to make some improvements, which ended up with the C++ and Java implementations being about the same speed. Some C++ hackers then went and improved the C++ benchmark by another ~6x, and I didn't think I should go down that road.

Imagine my surprise when I got this email from a friend of mine this morning:

Date: Fri, 3 Jun 2011 03:20:48 -0700
Subject: Java Tunings
To: Jeremy

"Note that Jeremy deliberately refused to optimize the code further,
many of the C++ optimizations would apply to the Java version as
well."

Slacker.


Gee, thanks for making me look good, Robert. (Actually, I had approved that language, but I didn't realize anyone would actually read the paper. :) )

Anyway, I was not just being obdurate here; there was a thought process involved. I should probably document it.
  1. The benchmark did not follow the rules of Java benchmarking. At some point, I should write a post about the rules of Java benchmarking (maybe I have? I don't know). Basically, it timed the full execution of the program, including startup time, JIT compilation time, and GC time. This isn't what you are supposed to do in Java or Scala, so I was very uncomfortable with the experimental setup. In 2011, if you care about startup time, then you probably shouldn't be using Java. The program ran very briefly, so the effect was measurable (once I got it to run fast enough).ETA: I am informed that later versions of this benchmark took this into account. Apologies for the confusion.

  2. It's kind of silly to compare C++ with Java. The paper is about Scala, of course, but the immediate internal reaction at Google was a discussion of how much worse Java was than C++. As it happens, they really do target two different niches. Programmers are more productive in Java (if they know both languages well). It's easier to test Java code. It scales to large code sizes better. The tradeoff is that you are only going to get gcc -O2 level performance, and you have to be careful about garbage collection. In many, many cases, this tradeoff isn't a big deal. A single-threaded performance arms race between the two doesn't make sense. This is the primary reason I stopped trying to make improvements.

  3. The benchmark did not use Java collections effectively. It was generating millions of unnecessary objects per second, and the runtime was swamped by GC overhead. In addition to the collections-related fixes I made (as described in the paper), the benchmark also used a HashMap<Type, Integer> when it should have just stored a primitive int in Type; it also used a LinkedList where it should have used an ArrayDeque. In fact, after being pressed on the matter by other Googlers (and seeing the response the C++ version got), I did make these changes to the benchmark, which brought the numbers down by another factor of 2-3x. The mail I sent Robert about it got lost in the shuffle (he's a busy guy), but he later said he would update the web site.


It is this last point that made the difference between Java and Scala. They use the same runtime, so the compiled performance is going to be roughly the same. However, the choice of data structure matters! From the perspective of Scala vs. Java performance, the takeaway isn't that "Java is slow", it is that "you can write bad code in any language" (with apologies to Robert, who is one of the smartest guys around, but isn't a Java programmer).

It is also "you can write bad code in any language" that inspired me to make any improvements. Robert was not a C++ newbie, but he was a Java newbie, so I thought that the comparisons that this was inspiring between C++ and Java performance basically made no sense. I didn't make the changes to demonstrate that Java could do better than C++, as I know that's not true. I made them to show that the playing field wasn't quite as uneven as people tend to think, and that algorithmic changes tend to make more difference than language changes.

These observations don't actually contradict the point that Robert was trying to make. In fact, they underline it. He was trying to say that Scala is very newbie-friendly when you want performance, as it balances the two well. That was clearly true in this case: as a newbie in both Java and Scala, Robert chose more performance-friendly data structures in Scala than in Java. In fact, in large measure, Scala chose the data structures for him, which is probably even better.

The other languages strike terrible balances in this area. Python and Go aren't on the performance radar at all (Python isn't compiled, and Go is young, yet). C++ isn't newbie friendly at all, ever.

Robert, of course, is a very odd choice for newbie, as he is an experienced compiler writer. Havlak is also an odd choice for newbie program, as it is a relatively sophisticated compiler benchmark. Perhaps the moral turns out to be that newbies who are also experiened compiler writers should look carefully at Scala?

Friday, May 13, 2011

Java Puzzlers@Google I/O

Josh Bloch and I gave one of his Java Puzzlers talk at Google I/O this year. If you hate Java, you can waste a perfectly good hour listening to us make unfunny jokes at its expense.

Tuesday, April 5, 2011

Cliff Click in "A JVM Does What?"

Cliff Click gave a very good talk at Google last week. For your viewing enjoyment:





The slides are available here.

(For those who watched it: The inventors of bytecode were in the audience. I think he was just used to making that joke elsewhere.)

Friday, August 27, 2010

JavaOne Talk Cancelled

Because I am a Google employee, my talks are not happening. Apologies to any interested parties.

I'm going to try to give the talks in another venue, or at least to use the blog as a venue for some of the topics I wanted to cover.

Monday, August 2, 2010

Why Many Profilers have Serious Problems (More on Profiling with Signals)

This is a followon to a blog post I made in 2007 about using signal handling to profile Java apps. I mentioned in another post why this might be a good idea, but I wanted to expand on the theme.

Hiroshi Yamauchi and I are giving a talk at this year's JavaOne. Part of this talk will be about our experiences writing profilers at Google, and in preparing for it, I realized my previous entry on this subject only told half the story.

The topic of the original post is that there is an undocumented JVMTI call in OpenJDK that allows you to get a stack trace from a running thread, regardless of the state of that thread. In Unix-like systems, you can use a SIGPROF signal to call this function at (semi-)regular intervals, without having to do anything to your code. The followon post, intended to describe why you would use this, described a couple of things that are true:

  • That it tells you what is actually running on your CPU, not just what might be scheduled, which is what profilers that take a snapshot of every running thread do. This increases its accuracy dramatically. For example, there are many profilers that report a great deal of your time spent in blocking mechanisms, simply because threads frequently become scheduled as they exit blocking functions. However, those sorts of threads are all "ready to be scheduled" rather than "actually running". As a result, CPU time is charged to those threads inaccurately.

  • That it reports time spent in GC and in other VM-specific activities (like JIT compiling). At Google, we find that substantial amounts of our users' CPU time is spent in VM activities. In fact, GC brings the occasional Google service to its knees (not mentioning any names).

I now realize that I left all of the JIT compiler-related reasons out of that post. This post is an attempt to repair that problem, and describe why profiling with AsyncGetCallTrace and signals is a better approach than the typical state of the art.

Sampling profilers that take thread dumps from individual threads often do so by injecting thread stack trace gathering code into your code. This is suboptimal in many ways:

  • It introduces overhead into your code. This is the most obvious way. As soon as you introduce overhead, you are changing the performance characteristics.

  • Changing the size of the code changes the optimizing decisions made by the JIT. For example, one of the most important performance improvements made by a just-in-time compiler is when it inlines small methods. It won't inline if the method gets too large. Introducing sampling into the methods changes the size, and therefore affects inlining decisions, changing the performance characteristics of your code.

  • Changing the size of the code affects code layout. For relatively obvious reasons of caching and memory alignment, the placement of your code can affect its performance. Bigger code often means worse performance.

At this point, you might be thinking that the best thing to do is use something like Thread.getAllStackTraces and take your chances with its inaccuracy (which is what the aforementioned sampling profilers do). However, there is one more factor, which is significant in any of the documented and built-in stack trace sampling methods - that system-wide stack trace sampling happens for any given thread when it is at a safe point. Safe points are places in the code that the VM knows it can do a whole host of things - like initiate garbage collection - safely. The location of these safe points is determined by the JIT. It often puts them in places that aren't ideal for CPU profiling. For example, there may be a hot loop in your code that the JIT decides should not be interrupted by a safe point. If you use most standard profilers, this hot loop will never get profiled! As a result, the placement of safe points affects the sampling quality of standard sampling profiling techniques.

(Another interesting point, as made by Todd Mytkowicz and Amer Diwan of the University of Colorado: since JIT behavior really depends on everything in the system, and most profilers are part of the system, the decision about where to put a safe point will end up depending on which profiler you are using. This can make the results of the profilers clash violently: because of their differing behaviors, the safe points end up in different places, and the profilers end up tracking different places. See Mytkowicz and Diwan's recent PLDI paper for details.)

All of this JIT talk basically mirrors the Heisenberg Uncertainty Principle, but for profiling. You can either have exact information, or you can have your code behave as it is supposed to behave, but not both. You need to minimize the effects, so you want a profiler that doesn't interfere with or depend on JIT decisions at all. AsyncGetCallTrace fits this bill - you call it, and it returns a result. You don't call it directly from your code. It doesn't wait for a safe point. It doesn't change your code. The JIT doesn't care.

ETA: I believe that the profiler bundled with Sun Studio uses the AsyncGetCallTrace method call, but I'm not exactly sure how.

Sunday, February 7, 2010

Garbage Collection, [Soft]References, Finalizers and the Memory Model (Part 2)

In which I ramble on at greater length about finalizers and multithreading, as well as applying that discussion to SoftReferences. Read Part 1 first. It explains why finalizers are more multithreaded than you think, and why they, like all multithreaded code, aren't necessarily going to do what you think they should do.

I mentioned in the last entry that a co-worker asked me a question about when objects are allowed to be collected. My questioner had been arguing with his colleagues about the following example:

Object f() {
Object o = new Object();
SoftReference<Object> ref = new SoftReference<Object>(o);
return ref.get();
}
He wanted to know whether f() was guaranteed to return o, or whether the new Object could be collected before f() returned. The principle he was operating under was that because the object reference was still on the stack, the object would not be collected. Those of you who read the last entry will know better, of course. To put it in the words of the Java Language Specification, optimizing transformations of a program can be designed that reduce the number of objects that are reachable to be less than those which would naively be considered reachable.

In practice, what this means is that the VM can:
  1. Decide the variable o is no longer used after the last time it appears in the program,

  2. Clear the SoftReference immediately after it is constructed, and

  3. Return null

Those of you who read my entry on how Hotspot decides to clear SoftReferences will know that this is extremely unlikely in Hotspot, where a SoftReference always survives for at least one collection. But in other VMs? I have no idea.

Anyway, the obvious question is: how do you prevent the collector from collecting this object? There are a few ways:

  1. The obvious answer is to return o instead of ref.get() (this was harder in the original code, which I've simplified to make my point).

  2. Another way is way is to make o reachable from a static field. That makes the object rather obviously reachable.

  3. A third way is to give the object a finalizer, and do synchronization in that finalizer, and then do synchronization on that
    object when you want the object to stay alive.

That last one probably requires a few more words. We wrote that into the spec so that the Finalizer Guardian idiom would work. The Finalizer Guardian idiom is described in Josh Bloch's invaluable book Effective Java (lgt amazon. Seriously, if you don't have a copy, go get one).

Anyway, we put a nice little exegesis on the link between the Finalizer Guardian and this rule into the language spec, so just go ahead and read it there.

Having said all of that, I will agree that this is a massive pain. It would be nice if keeping objects reachable-and-not-collected were a little easier. Doug Lea has proposed a method in his new Fences API called reachabilityFence. This method would keep an object reachable (in a garbage collection sense) until execution reached that point in the code. I'm on the fence about the Fences API, but I think that this particular method (which he alternatively called keepAlive) would be a useful tool.

Sunday, January 31, 2010

Garbage Collection, References, Finalizers and the Memory Model (Part 1)

A little while ago, I got asked a question about when an object is allowed to be collected. It turns out that objects can be collected sooner than you think. In this entry, I'll talk a little about that.

When we were formulating the memory model, this question came up with finalizers. Finalizers run in separate threads (usually they run in a dedicated finalizer thread). As a result, we had to worry about memory model effects. The basic question we had to answer was, what writes are the finalizers guaranteed to see? (If that doesn't sound like an interesting question, you should either go read my blog entry on volatiles or admit to yourself that this is not a blog in which you have much interest).

Let's start with a mini-puzzler. A brief digression: I'm calling it a mini-puzzler because in general, for puzzlers, if you actually run them, you will get weird behavior. In this case, you probably won't see the weird behavior. But the weird behavior is perfectly legal Java behavior. That's the problem with multithreading and the memory model — you really never know what the results of a program will be from doing something as distasteful as actually running it.

Anyway, suppose you have a class that looks like this:

class FinalizableObject {
int i; // set in the constructor
int j; // set by the setter, below
static int k; // set by direct access
public FinalizableObject(int i) {
this.i = i;
}
public void setJ(int j) {
this.j = j;
}
public void finalize() {
System.out.println(i + " " + j + " " + k);
}
}

And then you use this class, thus:

void f() {
FinalizableObject fo = new FinalizableObject(1);
fo.setJ(2);
FinalizableObject.k = 3;
}

Let's say that by some miracle the finalizer actually runs (Rule 1 of why you don't use finalizers: they are not guaranteed to run in a timely fashion, or, in fact, at all). What do you think the program is guaranteed to print?

Those of you who are used to reading these entries will realize immediately that unless they actually already know the answer, they have no idea. Let's try to reason it out, then.

First, we notice that the object reference fo is live on the stack when all three variables are set. So, the object shouldn't get garbage collected, right? The finalizer should print out 1 2 3, yes?

Would I have asked if that were the answer?

It turns out that the VM, as usual, is going to play some tricks here. In the words of the JLS, optimizing transformations of a program can be designed that reduce the number of objects that are reachable to be less than those which would naively be considered reachable. What this means is that the VM is going to make your object garbage sooner than you think.

The VM can do a few things to effect this (yes, this is the correct spelling of effect). First, it can notice that the
object is never used after the call to setJ, and null out the reference to fo immediately after that. It's reasonably clear that if the finalizer ran immediately after that, you would see 1 2 0.

That's not the end of it, though. The VM can notice that:

  • This thread isn't using the value written by that write to j, and

  • There is no evidence that synchronization will make this write visible to another thread.

The VM can then decide that the write to j is redundant, and eliminate it that write altogether. Woosh! You get 1 0 0.

At this point, you are probably expecting me to say that you can also get 0 0 0, because the programmer isn't actually using the write to i, either. As a matter of fact, I'm not going to say that. It turns out that the end of an object's constructor happens-before the execution of its finalize method. In practice, what this means is that any writes that occur in the constructor must be finished and visible to any reads of the same variable in the finalizer, just as if those variables were volatile. This paragraph originally read incorrectly

The immediate question is, how does the programmer avoid this insanity? The answer is: don't use finalization!

Okay, that's not enough of an answer. Sometimes you need to use finalization. There's a hint several paragraphs up. The finalizer takes place in a separate thread. It turns out that what you need to do is — exactly what you would do to make the code thread-safe. Let's do that, and look at the code again.

class FinalizableObject {
static final Object lockObject = new Object();
int i; // set in the constructor
int j; // set by the setter, below
static int k; // set by direct access
public FinalizableObject(int i) {
this.i = i;
}
public void setJ(int j) {
this.j = j;
}
public void finalize() {
synchronized (lockObject) {
System.out.println(i + " " + j + " " + k);
}
}
}

And then you use this class, thus:

void f() {
synchronized (lockObject) {
FinalizableObject fo = new FinalizableObject(1);
fo.setJ(2);
FinalizableObject.k = 3;
}
}

The finalizer is now guaranteed not to execute until all of the fields are set. When that sucker runs, you will see 1 2 3.

Oddly, I've been writing for almost an hour and I haven't gotten to my coworker's question yet. In the interests of brevity, I'll make this a series. More later.