Tuesday, February 12, 2008

Causality and the Java Memory Model

NB: I started writing this last August, and sent it to a friend of mine for thoughts. I just realized that he never got back to me on it, and I never posted it. I don't much feel like editing it now, so here it is in its original glory

Okay, I know. It is nearly impossible to read and understand the causality bits of the Java memory model. That's all that stuff about "Executions and Causality Requirements" on pages 568 – 570 of the Java Language Specification. Bill Pugh, Sarita Adve and I wrote it. Basically, in the last three years, there have been three kinds of reactions to it:

  1. ZOMGWTF!? (I'm a normal human being. Give me a break.)

  2. This is incomprehensible gibberish. (I'm a pretty smart cookie and I think that everything I can't understand in three seconds is worthless.)

  3. Okay, I think I get it, but I have some questions. (I'm a pretty smart cookie and I spent several days trying to understand it.)

I find that the last group usually pretty much understands it, with a few minor clarifications. They are also overwhelmingly graduate students, for what it's worth. They have found a few flaws in it, which is also nice.

(Wait, I hear you ask, nice? Alas, I must admit that I never thought the JMM would be bulletproof. It was no more likely to be bulletproof than version 1.0 of a major software release. The main goal was to create a memory model which described the problem space completely, was practical for a real programming language and was rigorous (so that it could be completely understood with sufficient effort). The fact that it was possible to find a few flaws in specific cases (rather than by saying "this is a complete mess", which used to be the case) means that people could understand it and reason about it, and that therefore the mission was accomplished. No one had ever done that for a real language before, and I'm proud of the work we did.)

That still leaves us with the original problem, which is that it is really, really hard to understand. What I'm going to do in this post is to try to explain the key intuition behind it very briefly. If it still doesn't make sense, then tell me about it, and I'll try to clean it up.

(The prerequisite to understanding what I am about to say is an understanding of the happens-before order. If you understand that, and you understand that the lack of that in your program can lead to weird effects, then the rest of the post may interest you. If you don't understand that, this is not the post for you.)

There are two basic questions that we need to address to understand the memory model:

Why is the happens-before relationship not sufficient?

The main guarantee we want to give programmers is something called "Sequential Consistency for Data Race Free programs" (SC for DRF, for short). What this means, in brief, is that if your program is free from data races, then you won't see any weird effects. If you are still reading, you know that a data race is two accesses, at least one of which is a write, to the same variable, which are not ordered by happens-before. Now, consider the example:


0: x == y == 0

Thread 1:
1: r1 = x;
   if (r1 == 1)
2:   y = 1;

Thread 2:
3: r2 = y;
   if (r2 == 1)
4:   x = 1;

If all of this program's statements were to execute, there would be happens-before relationships between 0 and 1, 1 and 2, 0 and 3, 3 and 4. But there would be no happens-before relationships between Thread 1 and Thread 2.

Now, an intuitive model, built entirely around happens-before, basically says that if you are trying to figure out the value that can be seen by a read, you can either look at the last write to a given variable via happens-before, or a write that is in a data race with the read. That would mean that the read at 3 could see the write at 2, and the read at 1 can see the write at 4, because those writes are in a data race with those reads.

We can now say that the read of x at 1 sees 1, and the read of y sees 1, and have that be legal under the happens-before model.

Do you see what happened there? The read of x at 1 saw the value 1, then wrote 1 to y at 2. The read of y at 3 saw the value 1, and wrote 1 to x at 4. That justified the read of x at 1 seeing the value 1.

We can't allow this, because it violates the SC for DRF property. A program is data race free if, under all SC executions -- those are the executions without any weird reorderings -- there are no data races. There are no data races under any SC execution of this program -- the reads will both return 0, so the writes won't occur, so there is no data shared between the two threads. Therefore, this program is data race free, and it has to have only SC executions. The execution I just described -- where the writes occur -- is not a SC execution.

Since the happens-before model allows this example, it doesn't enforce the SC for DRF property. We therefore had to make some additional guarantees.

Um... okay. What are the additional guarantees?

Now that we have decided that there are actions that are not allowed to happen --in the above example, the writes are not allowed to occur -- the trick is to figure out when an action is allowed to occur, and when it is not.

Now, if you are writing code, and want to figure out if an action can occur, what do you do? Well, you look at all of the actions that came before it. So let's look at the actions that happen-before the action you want to justify, and determine whether those actions justify it.

In the above example, we look at instruction 2, and then look at the actions that happen-before it, which are 1 and 0. 1 and 0 don't justify 2 happening, so it is not allowed to happen.

Let's try this on a different example. This is a good one:

0: x == y == 0

Thread 1:
1: r1 = x;
2: y = 1;

Thread 2:
3: r2 = y;
4: x = r2;

The interesting result here is r1 == r2 == 1, which is possible in a real execution if Thread 1's actions are reordered by the compiler. Basically, the write to y occurs, then Thread 2 reads the 1 from y and writes it to x, and then Thread 1 reads that value for x.

The first instruction we want to execute is the one labeled "2". If you look at the actions that happen-before 2 (again, 1 and 0), they will always lead into 2 being executed. So 2 is allowed to happen.

That's well and good for action 2, but the interesting result requires that the value 1 be read from y in instruction 3. This isn't justifiable based on what happens-before 3 -- if we are only looking at what happens-before 3, the only value that can be read from y is 0.

So we need some way of justifying the read of y in 3 returning the value 1. It can only do this if something wrote the value 1 to y. But wait -- we've already said that it was justifiable to write 1 to y! So if, in addition to being able to justify actions based on the actions that happen-before them, we also say that it is possible to justify actions based on actions you have already justified, then we have now allowed our result!

So there is the intuition behind the model. When you want to justify the fact that a read returns the value of a write, you can do so if:

a) That write happened before it, or
b) You have already justified the write.

The way that the model works is that you start with an execution that has all of the actions justified by property a), and you start justifying actions iteratively. So, in the second example above, you create successive executions that justify 0, then 2, then 3, then 4, then 1.

This definition has the lovely property of being reasonably intuitive and also guaranteeing SC for DRF programs.

That's the most complicated bit of the model (perhaps excluding the stuff about observable actions and infinite executions, but they are much less important).

I hope that helps.