Thursday, August 30, 2007

Atomicity, Visibility and Ordering

(Note: I've cribbed this from my doctoral dissertation. I tried to edit it heavily to ease up on the mangled academic syntax required by thesis committees, but I may have missed some / badly edited in places. Let me know if there is something confusingly written or just plain confusing, and I'll try to untangle it.)

There are these three concepts, you see. And they are fundamental to correct concurrent programming. When a concurrent program is not correctly written, the errors tend to fall into one of the three categories: atomicity, visibility, or ordering.

Atomicity deals with which actions and sets of actions have indivisible effects. This is the aspect of concurrency most familiar to programmers: it is usually thought of in terms of mutual exclusion. Visibility determines when the effects of one thread can be seen by another. Ordering determines when actions in one thread can be seen to occur out of order with respect to another. Let's talk about them.

Atomicity

Everyone doing any serious concurrent programming knows what atomicity is (or will when I describe it) — I'm just putting it in for completeness's sake. If an action is (or a set of actions are) atomic, its result must be seen to happen ``all at once'', or indivisibly. Atomicity is the traditional bugbear of concurrent programming. Enforcing it usually means using locking to enforce mutual exclusion. To see atomicity in action (or in inaction, perhaps), consider this code:

class BrokenBankAccount {
  private int balance;

  synchronized int getBalance() {
    return balance;
  }

  synchronized void setBalance(int x) 
    throws IllegalStateException {
    balance = x;
    if (balance < 0) {
      throw new IllegalStateException("Negative Balance");
    }
  }

  void deposit(int x) {
    int b = getBalance();
    setBalance(b + x);
  }

  void withdraw(int x) {
    int b = getBalance();
    setBalance(b - x);
  }
}

Since all accesses to the shared variable balance are guarded by locks, this code is free of what are called data races, which are basically what happens when you access a variable concurrently without some use of synchronization or volatile or the like (one of those accesses has to be a write to be a data race in the true sense). When code is free from data races, we say it is correctly synchronized. So the code is correct, right?

No, of course not. This code is not at all correct, in the sense that it doesn't necessarily do what we want it to do. Think about what happens if one thread calls deposit(5) and another calls withdraw(5); there is an initial balance of 10. Ideally, at the end of these two calls, there would still be a balance of 10. However, consider what would happen if:
  1. The deposit() method sees a value of 10 for the balance, then

  2. The withdraw() method sees a value of 10 for the balance
    and withdraws 5, leaving a balance of 5, and finally

  3. The deposit() method uses the balance it originally saw (10) to
    calculate a new balance of 15.
As a result of this lack of "atomicity", the balance is 15 instead of 10. This effect is often referred to as a lost update, because the withdrawal is lost. A programmer writing multi-threaded code must use synchronization carefully to avoid this sort of error. In Java, if the deposit() and withdraw() methods are declared synchronized, it will ensure that locks are held for their duration: the actions of those methods will be seen to take place atomically.

Atomicity, of course, is only guaranteed when all the threads use synchronization correctly. If someone comes along and decides to read the balance without acquiring a lock, it can end up with all sorts of confusing results.

Atomicity is the most common problem you get when using synchronization. It is a common mistake to think that it is the only problem; it is not. Here's another one:

Visibility

What's visibility, you ask? Well, if an action in one thread is visible to another thread, then the result of that action can be observed by the second thread. In order to guarantee that the results of one action are observable to a second action, then you have to use some form of synchronization to make sure that the second thread sees what the first thread did.

(Note: when I say synchronization in this post, I don't actually mean locking. I mean anything that guarantees visibility or ordering in Java. This can include final and volatile fields, as well as class initialization and thread starts and joins and all sorts of other good stuff.)

Here's an example of a code with visibility problems:

class LoopMayNeverEnd { 
  boolean done = false; 

  void work() { 
    while (!done) { 
      // do work 
    } 
  } 
 
  void stopWork() { 
    done = true; 
  } 
} 


In this code, imagine that two threads are created; one thread calls work, and at some point, the other thread calls stopWork on the same object. Because there is no synchronization between the two, the thread in the loop may never see the update to done performed by the other thread. In practice, this may happen if the compiler detects that no writes are performed to done in the first thread; the compiler may decide that the program only has to read done once, transforming it into an infinite loop.

(By the way, "compiler" in this context doesn't mean javac — it actually means the JVM itself, which plays lots of games with your code to get it to run more quickly. In this case, if the compiler decides that you are reading a variable that you don't have to read, it can eliminate the read quite nicely. As described above.)

To ensure that this does not happen, you have to use a mechanism that provides synchronization between the two threads. In LoopMayNeverEnd, if you want to do this, you can declare done to be volatile. Conceptually, all actions on volatiles happen in a single order, where each read sees the last write in that order. In other words, the compiler can't prevent a read of a volatile from seeing a write performed by another thread.

There is a side issue here; some architectures and virtual machines may execute this program without providing a guarantee that the thread that executes work will ever give up the CPU and allow other threads to execute. This would prevent the loop from ever terminating because of scheduling guarantees, not because of a lack of visibility guarantees. This is typically called cooperative multithreading.

There's one more problem that crops up:

Ordering

Ordering constraints describe what order things are seen to occur. You only get intuitive ordering constraints by synchronizing correctly. Here's an example of when ordering problems can bite you:

class BadlyOrdered {
  boolean a = false;
  boolean b = false;

  void threadOne() {
    a = true;
    b = true;
  }

  boolean threadTwo() {
    boolean r1 = b; // sees true
    boolean r2 = a; // sees false
    boolean r3 = a; // sees true
    return (r1 && !r2) && r3; // returns true
  }

}

Consider what happens if threadOne() gets invoked in one thread and threadTwo() gets invoked on the same object in another. Would it be possible for threadTwo() to return the value true? If threadTwo() returns true, it means that the thread saw both updates by threadOne, but that it saw the change to b before the change to a.

Well, this code fragment does not use synchronization correctly, so surprising things can happen! It turns out that Java allows this result, contrary to what a programmer might have expected.

The assignments to a and b in threadOne() can be seen to be performed out of order. Compilers have a lot of freedom to reorder code in the absence of synchronization; they could either reorder the writes in threadOne or the reads in threadTwo freely.

How do you fix it? Synchronize your code carefully! In this case, you can throw a lock around threadOne or threadTwo, or you can declare them both both a and b to be volatile, and get the ordering you want.




So keep an eye peeled for these things, and don't assume that your intuition about the way your code is executed will be correct. Everyone knows about atomicity concerns, but many fewer people know about visibility and ordering. Keep your nose clean, and you may get out of this thing alive...

Tuesday, August 21, 2007

Volatile Does Not Mean Atomic!

Here's a question I get a lot. Is the following code snippet "thread safe"?


volatile int v = 0;

Thread 1:
v++;

Thread 2:
v--;

The question asks what the possible results of this code are; the questioners usually want the answer "v can only be 0 after this code is run".

This isn't the way it works! If you do an increment of a volatile integer, you are actually performing three separate operations:

  1. Read the integer to a local.

  2. Increment the local.

  3. Write the integer back out to the volatile field.


So what you really have is this:

volatile int v = 0;

Thread 1:
r1 = v;
r2 = r1 + 1;
v = r2;

Thread 2:
r3 = v;
r4 = r3 - 1;
v = r4;

So, if Threads 1 and 2 both read v and see the value 0, then Thread 1 will write 1 to it and Thread 2 will write -1 to it. You are not guaranteed to see the value 0!

If you want an atomic increment (or decrement), you have to use the java.util.concurrent.atomic classes, which allow you to create object that represent numbers that can be incremented or decremented atomically. The VM is smart enough to replace the objects with plain ol' numbers (a process that I claim is called intrinsification), which it then uses atomic machine instructions to manipulate.

So beware!

ETA: There was a bit of confusion about atomicity after I posted this. For more on atomicity, visibility and ordering, check out this post.

Tuesday, August 14, 2007

Java Puzzlers at Google

Those of you who attend JavaOne will know Josh Bloch, Neal Gafter and Bill Pugh's series of highly entertaining Java Puzzlers. Now, for the first time, one of these talks is available to people who are not JavaOne attendees: Josh and Bill gave this year's Java Puzzlers talk at Google.

I wouldn't dare spoil the fun by posting any of the puzzlers or their solutions (although Josh blows the surprise on one of them). This is definitely worth the time of anyone who does any programming in Java (and has 1 hour and 14 minutes to kill).

I would be remiss if I didn't point out the Java puzzlers web site, where a book is available.

Wednesday, August 1, 2007

Mark Moir on Transactional Memory

This is another one of those Google talks. This time, it is Mark Moir talking about the transactional memory work he is doing at Sun. It is very useful if you are interested in listening to a strong statement about building hybrid software / hardware transactional memory systems.

As someone who has written papers on transactional memory systems, I am still unconvinced that they are the panacea that some of their supporters seems to claim (not Mark, as far as I can tell). Plusses:

a) Deadlock avoidance (if the semantics are chosen correctly),

b) More easily programmed non-blocking algorithms (if the system is implemented incredibly carefully).

Minuses:

a) They are still prey to the other problems of parallel shared memory systems, like data races and atomicity problems,

b) I have yet to hear a convincing argument as to what their actual semantics should be, especially as regards nested transactions, but even for ordinary transactions.

One language design (which shall remain nameless) had an built-in transactional mechanism. When I asked the designer how it worked, he said that they implemented it with a single big lock, and hoped to improve the performance later. This basically throws out any atomicity guarantees, and is also a potentially fatal performance flaw. It seems to me that before TM advocates can talk about benefits for programmability, they have to straighten out what it means to program one of these things.

I could go on about this for a very long time, but there really isn't much of a point. I'm actually on a major tangent from the main point, which is that Mark is doing very good work, and (wisely) avoided most of these sorts of questions in the talk.

Monday, June 25, 2007

Signals and Java

Ooookay. You are now, presumably, fresh from reading the original post about using AsyncGetCallTrace and the more recent post about how to use SIGPROF in C++. The next obvious question is...

Can I Register a Signal Handler in Java?

Um... Kind of!

There is an undocumented and unsupported interface called sun.misc.Signal present in most Java implementations. For those of you who have downloaded the JDK, the files live in /src/share/classes/sun/misc.

If you import sun.misc.Signal and sun.misc.SignalHandler, you can define a signal handler like this:

// Handles SIGHUP
Signal.handle(new Signal("HUP"), new SignalHandler() {
// Signal handler method
public void handle(Signal signal) {
System.out.println("Got signal" + signal);
}
});

You can raise a signal like this:

Signal.raise(new Signal("HUP"));

You can even register a native signal handler with the Signal.handle0 method (I'll leave that as an exercise for the reader).

Having said that, none of this helps us set timers or do profiling, really. It's just an interesting aside.

More about profiling with SIGPROF

First, read this post about profiling with JVMTI and SIGPROF.

A poster on that entry asked me to elaborate a little on exactly what happens when the timer goes off and the signal is raised. In this post, I'll explain that in a little more detail. First, I'll back up a little.

Edited to add: Okay, I wrote this whole thing, and very, very briefly answered the actual question at the end. So, if you are interested in the answer to the actual question, and not a long discourse on how to use SIGPROF, look at the end of this entry.

What is a Signal?

A signal is a very basic method used either to indicate odd behavior in a program synchronously in UNIX-alike systems. If a process divides by zero, for example, a SIGFPE (Floating Point Exception) is synchronously sent to that process. Signals can also be used asynchronously; for example, one process can send another a signal. The other process determines how it wants to handle that signal when it receives it. That process can either register a signal handler to handle that signal, or let the default action occur.

On most modern UNIX systems, there are 32 signals. If you want to see the full list in a glibc system, see the header file /usr/include/bits/signum.h. They each map to a number; SIGKILL, for example, is number 9. You can use the command line program "kill" to send a signal to any process. If you send a SIGKILL to a program using "kill -9 ", for example, it will take the default action of terminating that process, unless it has registered a signal handler.

SIGPROF is the one we care about for the purposes of this post. It is used for sample profiling. UNIX systems can be set up to send this signal at a regular interval; the application can register a handler to record profiling information in the handler.

How Do I Register a SIGPROF Signal Handler?

On Linux, there are several steps involved in registering your own signal handler. The basic idea is that a sigaction structure needs to be passed to the sigaction system call. Warning: I am not making a pretense that this is complete. Many man pages should be studied before attempting this. On Linux, a sigaction looks like this:

struct sigaction {
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
sigset_t sa_mask;
int sa_flags;
void (*sa_restorer)(void);
}

The first field, sa_handler, is the handler that you want to define. We're only going to fill in that one — if you are interested in the others, then look at the sigaction man page.

You define the handler by declaring a void function that takes a single integer argument:

void handler(int signal) {
printf("received signal %d", signal);
}

You then define your sigaction struct, and pass it to sigaction:

struct sigaction sa;
sa.sa_handler = &handler;
sigemptyset(&sa.sa_mask); // Oh, just look it up.
struct sigaction old_handler;
sigaction(SIGPROF, &sa, &old_handler);

If there was already a registered handler, it will be in old_handler. You then set a timer to go off at a specified interval, and send your process a SIGPROF:

static struct itimerval timer;
timer.it_interval.tv_sec = 1; // number of seconds is obviously up to you
timer.it_interval.tv_usec = 0; // as is number of microseconds.
timer.it_value = timer.it_interval;
setitimer(ITIMER_PROF, &timer, NULL);

setitimer tells the system to deliver a SIGPROF at the interval specified by the timer. The system will wait for the timer to go off, deliver a SIGPROF, and repeat until the process dies or the itimer is replaced.

At this point, your signal handler is complete!

Or...?

Well, it turns out that it isn't that simple (and bear in mind that we haven't even gotten to the Java part, yet). There is the issue of threading. The signal is handled in-process. So which thread actually handles it? Well, it turns out that this is not well-specified.

In Linux 2.4 and earlier, each thread was a separate process. Because of this, you need to set up a separate timer for every thread that you want to be profiled. In other words, it is more-or-less completely unusable unless you have a very tightly controlled environment.

In Linux 2.6, a randomly chosen, currently executing thread is picked to execute the signal handler. This is awfully useful, considering that what we want to do is find out what the currently executing threads are doing.

There are a lot of pitfalls here, so proceed with caution. For example, in C code, you can't really grab an arbitrary lock in a signal handler, because the thread executing might already have that lock; because most C lock implementation aren't reentrant, this can cause serious problems.

There is also the issue of what happens if you only want some threads to be able to handle the signal, and you want to disable it for the rest. They invented pthread_sigmask() for this purpose. If you want to disable the handling of a signal altogether, you can use sigprocmask().

What's the point of all of this?

The original question asked me what exactly happened when the timer went off and the signal got raised by the OS. He thought that perhaps it raised a Java exception. It doesn't. It just delivers the signal to the process. So I wrote a C++ handler that called AsyncGetCallTrace, and loaded it in the Agent_OnLoad part of a JVMTI agent.

Saturday, June 9, 2007

Answer to Weekend Multithreaded Puzzler

This is the answer to the riddle posed in my earlier puzzler posting. You should probably look at that question before looking at this answer.

I suppose I should call it something other than a puzzler, to avoid getting hit by Josh and Neal's angry team of vicious, snarling lawyers...

This program certainly looks straightforward. It just looks as if two threads are writing to two variables. In fact, you probably expected me to say something "who-cares" oriented about compiler optimizations at this point. Well, they are volatile variables, so you can worry a lot less about potential reorderings. In fact, this one has absolutely nothing to do with program transformations, and, if you ran the program on my laptop, you found that it hangs!

It turns out that this is the result of one of those vicious little static initialization quirks that provide so many hours of headaches. What happens is something very like the following. The first thread encounters A; this class has not been initialized, so that thread tries to initialize it. When you try to initialize a class, you acquire a lock on that class, so that no one else can initialize it at the same time. Are you starting to see where this could lead to problems?

At the same time as this, the second thread encounters B, and so acquires the lock on the B class. It then runs the static initializer for B, which encounters A. "Wait!" it says -- "A hasn't been initialized! Better acquire that initialization lock..." It tries to acquire the lock, but the first thread already has it, so it waits for the first thread to finish.

Meanwhile, the same process goes on in the first thread. It runs the static initializer for A, which encounters B. "Wait!" it says -- "B hasn't been initialized! Better acquire that initialization lock..." It tries to acquire the lock, but the second thread already has it, so it waits for the second thread to finish.

Result: Both threads wait forever. Deadlock!

This whole process is scheduling / hardware / OS / JVM dependent, of course. If the first thread runs to completion without letting the second thread start, then it will quite happily initialize both A and B without the other thread acquiring any locks. This will avoid deadlock nicely. This seems to happen on Linux, but not OS X.

How do you avoid this? Well, that's a little tricky. In this case, you would probably rewrite the code so that it doesn't perform the initialization in two separate threads. That's not always a general-purpose solution, though. Your best bet, in general, is to avoid having circularities like this in your static initializers. As with constructors, it is important to keep your static initializers as simple as possible.

Keeping it simple might mean not doing anything that might trigger subsequent static initialization. That's a good first option, if you can manage it, but it is not always possible.

The second option is to make sure that you have an order over your classes. For example, if you have three classes, A, B and C, you could structure them so that C can refer to A and B in its static initializer, B can only refer to A, and A can only refer to itself. This will prevent deadlocks by enforcing a strict order over when the implicit locks can be acquired.

The final option -- if you know this will be a problem -- is to make sure that the initialization can only happen in a single thread. This may mean having to force it to occur earlier, by referencing the class earlier than it would otherwise have been referenced.

I feel like there should be a moral. The moral is that static initialization has all sorts of hidden gotchas, and you should look out for it.

Weekend Multithreaded Puzzler Fun!

Although I didn't go this year, one of my favorite parts of JavaOne is always Josh Bloch and Neal Gafter's talk on Java Puzzlers. (This year, Bill Pugh, my graduate advisor, stepped in for Neal.) A puzzler is a short snippet of code which has unexpected results, usually because some language or API feature behaves in a strange way. I enjoy them because I always think it is truly wonderful to have the depth of my ignorance exposed.

Josh and Neal wrote an excellent book with all of the Java Puzzlers through 2005, which is highly recommended, and occupies a place of honor in the stack of books in my bathroom.

The point of all of this is that occasionally, I will send Josh a multithreaded puzzler, and he will tell me it is no good, because you can't reproduce it every time. Here's one I sent him a couple of months ago.

It turns out that the following snippet of code displays the odd behavior 100% of the time under the JVM on my MacBook Pro (JDK 1.5.0_07), and won't display it at all on Linux. I haven't tried Windows. Can you figure out what the odd behavior will be? If you have an Intel-based Mac, you can probably even reproduce it.


class A {
volatile static int x = 1;
static {
B.y = 1;
}
}

class B {
volatile static int y = 2;
static {
A.x = 2;
}
}

public class Test {
public static void main(String [] args) {
Thread t1, t2;
(t1 = new Thread() {
public void run() {
A.x = 1;
}
}).start();
(t2 = new Thread () {
public void run() {
B.y = 2;
}
}).start();
try {
t1.join(); t2.join();
} catch (InterruptedException e) {}
}

}

Monday, June 4, 2007

More thoughts on SIGPROF, JVMTI and stack traces

This is a follow up from my previous post about profiling with SIGPROF and JVMTI.

There are, in fact, a very large number of ways of getting stack traces out of a JVM.

  • If you send a system-dependent signal to your JVM process, it will spit out a stack dump of every currently live thread. On Solaris and Linux, the signal is a SIGQUIT, which you can get by using kill -3 on the JVM PID (or the PID of the parent JVM process under Linux 2.4), or hitting Control-\. On Windows, you can achieve the same effect by hitting Ctrl-Break on Windows.

  • If you call Thread.dumpStack(), or create a new Throwable, and invoke getStackTrace() on it, you can get the current stack trace programmatically.

  • If you use ThreadMXBean's getThreadInfo methods, you can get stack traces out of any threads you want.

  • If you use JVMTI's GetStackTrace or GetAllStackTraces methods, you can get stack trace information in native code.


Most of these methods will tell you if your thread can be scheduled by the JVM or Operating System. This information will be reported as part of the thread info -- the thread will be described as "Runnable".

However, there is a big difference between "Runnable" and "Running". If you send a SIGPROF to your JVM and use AsyncGetCallTrace, you find out exactly what your JVM is doing at precisely the moment you sent the signal.

The difference here is fundamentally that all of those other methods tell you what the JVM could be doing, and this one tells you what it is doing. It will even tell you if it is performing garbage collection. This sort of information can be invaluable when you want to know what is soaking up your CPU cycles.

C++ Threads

There is a very good talk by Lawrence Crowl on the upcoming threading changes to C++. I wrote a brief entry about his talk on C++0x (where they are hoping for x < 10). They have developed heavily on the work done for the Java model, so that they could resolve some of the C++ absurdities that inevitably occur. Hans Boehm, who was heavily involved in the Java effort, has been leading the effort.

One neat feature is the proposed atomic keyword. All accesses to a variable declared atomic will be, obviously enough, atomic. It will support features like compare-and-swap and atomic increment (of numerical types). The neat part is that this will work for more than just scalar types (as it does in most current systems). You can declare an entire object to be atomic, and update it all at once. Efficiency depends, of course, on whether the hardware supports such operations, or they need to be emulated in software.

As this is C++, they felt the need to overload operators for atomic support. For example, if you have an atomic int v, then code that reads v++ performs an atomic increment. This is reasonably intuitive, and has been the source of confusion for some Java programmers with volatile variables.

The problem is that in order to support this, they have to start having some really messy details. For example, the semantics of the assignment operator (=) usually involve a load followed by a store, and for the operator to return whatever the result of evaluating the RHS was. This makes assignment a two-step process.

Why is this tricky? Let's say we have two atomic integers, a and b. If you say something like a += 4, you simply perform a machine-level atomic increment of a by 4, and it works trivially. On the other hand, if you have a = b, then you would have to assign the value of b to a without letting the value of b change while you are changing a. This is not supported by most architectures. So, they allow overloading of the assignment operator, but only if there is no atomic variable on the RHS of the equation. How ugly is that?

There are a lot of other interesting details in the talk. It is definitely recommended.

Wednesday, May 23, 2007

Double-Checked Locking and the Problem with Wikipedia

I love Wikipedia, I really do. I use it for everything. Well, not everything. For example, if I want a really good piece of chocolate cake, I tend to use flour, sugar, butter, eggs, chocolate, and not Wikipedia at all. For many things, however, Wikipedia cannot be surpassed.

Imagine, then, my surprise when I found out that their page on Double-Checked Locking gave an example of a version of double-checked locking that does not work.

// Broken -- Do Not Use!
class Foo {
  private Helper helper = null;
  private boolean initialized = false;
  public Helper getHelper() {
    if (!initialized) {
      synchronized(this) {
        if (!initialized) {
          helper = new Helper();
          initialized = true;
        }
      }
    }
  return helper;
}


(For background on Double-Checked Locking, see Bill Pugh's "Double-Checked Locking is Broken" Declaration.)

The problem with this code is exactly the same as the problem with all of the other variants of the double-checked locking idiom. You can reorder the assignment of initialized with the assignment to helper. As a result, another thread might see the value true for initialized and think that the helper field has been correctly initialized, even though it hasn't been!

I have since fixed the Wikipedia entry; the old entry can be found here.

Please folks -- stop trying to come up with fancy ways of avoiding volatile and synchronized annotations. Friends don't let friends avoid necessary synchronization.

Monday, May 14, 2007

Profiling with JVMTI/JVMPI, SIGPROF and AsyncGetCallTrace

This post is not about the memory model, but, instead, is about being able to get, asynchronously, the stack trace of a currently executing thread out of Sun's Java VM. Getting one synchronously is easy, of course.

I was trying to profile some Java applications under Linux, and I wanted to use SIGPROF and JVMTI to find out what code was running at any given second. This is just a posting to document how I did it, as I ended up using some undocumented JVM features, and I can't find a proper reference for them anywhere else.

More information as to why you might want to do this is here.

JVMTI is the Java Virtual Machine Toolkit Interface. It's a C/C++ interface to the virtual machine that allows you to hook in debuggers and profilers. More information can be found here.

SIGPROF is a POSIX signal. The way it is set up under Linux is that you can set a timer that goes off at intervals, and, whenever the timer expires, the SIGPROF signal is sent to the application. The idea is that you can then collect profiling information at fixed intervals.

The signal is caught and handled by a random running thread. That thread will be doing some task related to running the Java application -- whether that is executing user code, running garbage collection, or doing internal VM maintenance.

The problem is that there is no officially documented JVMTI hook that allows the user to find out exactly what the currently running thread is doing that is safe to run in a signal handler. The official way of getting a stack trace for the currently executing thread, the GetStackTrace function, isn't safe to be called in an asynchronous way -- if you try to read the stack when a timer expires, the Java stack could be in a corrupt state, or GC could be running, or any number of other things.

It turns out that the kind folks who wrote the Java virtual machine were fully aware of this, and provided an undocumented interface for this type of profiling, used by their Forte Analyzer (which now operates under the Sun Studio umbrella, I believe). Now that they've open-sourced the JVM, this is public knowledge. For those of you who like to see the source code for such things, it can be found in hotspot/src/share/prims/forte.cpp.

In principle, AsyncGetCallTrace is fairly easy to use. This is less true in practice. Since JVMPI has been removed in JDK6, you start by having to include JVMPI structures in your headers. In JDK5 and earlier, this step in unnecessary (all covered under the GPL):

typedef struct {
jint lineno;
jmethodID method_id;
} JVMPI_CallFrame;

typedef struct {
JNIEnv *env_id;
jint num_frames;
JVMPI_CallFrame *frames;
} JVMPI_CallTrace;

Now that you have the JVMPI structures defined, you need a prototype for the undocumented call:

extern "C"
void AsyncGetCallTrace(JVMPI_CallTrace *trace, jint depth,
void* ucontext)
__attribute__ ((weak));

The __attribute__ ((weak)) is only for g++ users -- it tells the compiler not to look for AsyncGetCallTrace at compile or link time. People using other compilers can create a library stub that contains this method -- this is left as an exercise for the reader.

Since this post is too long already, I will assume that you know how to write both JVMTI and signal handlers. If you want me to write about that, leave a response, and I'll get to it at some point.

So, you write a bit of JVMTI that executes this method, and find out that although the method is invoked, it always returns -1 for the stack depth. It turns out that jmethodIDs (which are the internal numerical representation for a method in JVMTI / JVMPI) are allocated lazily. Most JVMTI methods call a method to GetAndAllocate these method IDs. AsyncGetCallTrace calls a getter, and relies on the jmethodIDs being pre-allocated. The expectation in the code is that if you have a ClassLoad hook in JVMTI, then the method ids will get allocated correctly.

So, you insert a ClassLoad Hook (all error correction code removed -- include your error correction code!):

void JNICALL OnClassLoad(jvmtiEnv *jvmti_env,
JNIEnv* jni_env,
jthread thread,
jclass klass) {
}

...

jvmtiEnv *jvmti;
vm->GetEnv((void **)&jvmti, JVMTI_VERSION);
jvmtiEventCallbacks *callbacks =
new jvmtiEventCallbacks();
callbacks->ClassLoad = &OnClassLoad;
jvmti->SetEventCallbacks(callbacks,
sizeof(jvmtiEventCallbacks));
jvmti->SetEventNotificationMode(JVMTI_ENABLE,
JVMTI_EVENT_CLASS_LOAD, NULL);

I ran it after I did this, and found out that the stack depth was no longer -1, but the method ids were invalid. This seems to violate the contract of AsyncGetCallTrace; maybe one of my audience can tell me that this is wrong. Anyway, in order to "prime the pump" of method ids, I inserted another hook -- this time, to load up the method ids:

void JNICALL OnClassPrepare(jvmtiEnv *jvmti_env,
JNIEnv* jni_env,
jthread thread,
jclass klass) {
// We need to do this to "prime the pump",
// as it were -- make sure that all of the
// methodIDs have been initialized internally,
// for AsyncGetCallTrace. I imagine it slows
// down class loading a mite, but honestly,
// how fast does class loading have to be?
jint method_count;
jmethodID *methods;
jvmti_env->GetClassMethods(klass, &method_count,
&methods);
delete [] methods;
}

This is pretty dumb, and I'd love to get rid of it. Anyone more familiar with this interface should slap me upside the head and tell me how to do it properly.

Anyway, after you do this, AsyncGetCallTrace returns valid results. If the resulting stack depth is negative, it wasn't able to grab a valid Java stack. The most important negative result is -2, which indicates that you sampled in the middle of garbage collection.

Friday, May 11, 2007

Google Talks

By the way, gentle reader, in case you haven't noticed, I am running a series of talks at Google.

All of them can be found here. Updated July 2015: They seem to be available on YouTube, now that Google Video is a long forgotten service: They can now be found here.

Roach Motels and The Java Memory Model

I've been asked by a couple of people lately about what a compiler can do with code motion around synchronized blocks. The first person asked me about the following example, so let's run with it:

x = 1;
synchronized(o) {
z = z + 1;
}
y = 1;

I may have answered this already (at some point), but I feel free to repeat myself.

The JMM has what we call "roach motel" ordering. "Roach Motel" is Black Flag's name for their cockroach trap; in grad school, my apartment had lots and lots and lots of cockroaches, so I find it particularly salient (although I think it was probably Bill Pugh who came up with this association).

In the 1980s, the slogan for Black Flag Roach Motels was "Roaches check in, but they don't check out". The memory model is the same: accesses can be moved into synchronized blocks, but they can't be moved out. So, in this case, you can get:

x = 1;
synchronized(o) {
z = z + 1;
}
y = 1;

if we want to move the write to y early, this code transforms to:

x = 1;
synchronized(o) {
z = z + 1;
y = 1; // code motion
}

and transforms to:

x = 1;
synchronized(o) {
y = 1; // code motion
z = z + 1;
}

and can't be transformed any more, because the write to y has been moved into the
synchronized block, but can't move out.

Similarly, if we want to move the write to x down:

x = 1;
synchronized(o) {
z = z + 1;
}
y = 1;

transforms to:

synchronized(o) {
x = 1; // code motion
z = z + 1;
}
y = 1

transforms to:

synchronized(o) {
z = z + 1;
x = 1; // code motion
}
y = 1

.. but the write to x can't move any further.

At the end of these sorts of transformations, you can end up with something that looks like this:

synchronized(o){
y = 1;
z = z + 1;
x = 1;
}

... so the accesses to x, y and z can happen in any order, but once you move them into the synchronized block, you can't move them out again.

This is a tiny bit misleading, of course, because they can go out the direction they came in. But there is no difference between that and not having entered the synchronized block in the first place.

Now, if the compiler can remove the synchronized block altogether, then all the accesses can be moved about freely. This can happen, for example, if o is an object only accessed by a single thread -- the compiler will know that no other thread can lock on it, so it can feel free to remove the synchronization. That's a tale for another day, of course.

Tuesday, May 1, 2007

Software Transactional Memory Talk at Google

Adam Welc, from Intel's Programming Systems Lab, came to Google to give a talk about what they are doing in the realm of Software Transactional Memory. He also talks briefly, at the end, about Intel's hardware plans.

For those who are interested in such things, this is a talk worth seeing.

Click Here.

Tuesday, April 17, 2007

Lock-Free Hash Tables

I forgot to post about this at the time. Cliff Click gave a talk about his lock-free hashtable at Google; it is definitely worth a peek. Available here.

There is a very funny thread on the concurrency-interest mailing list this week about "lock-free mania". Apparently, software engineers are going to start choosing "trendy" lock-free data structures for no valid reason, resulting in lots of fragile code. In related news, recursion, memoization and every data structure more complicated than a hash table just started laughing their arses off. B+ Trees were quoted as saying, "I can't count the number of times some database-implementing hack has thrown together a cheap and poorly tested implementation of me because he thinks it is 'trendy' to put together an indexing mechanism. Oh, wait. Zero is a number. Never mind!"

In all seriousness, I have actually stopped someone from using a lock-free algorithm where a lock would have been more readable, and had the same performance impact. So, as with every other programming technique on the planet, you have to be careful about where you use it.

Personally, I see the complexity of lock-free algorithms as a hack that just needs to hold us over until hardware atomicity is supported for more than 1 word at a time. I actually asked Cliff about this in the talk, because he controls his hardware (I start laughing, he stares at me for a bit, and then I ask that question, which was not the question I was imagining.) He only did it in the way he did because he wanted his hash table to be useful for other Java implementations. That's community spirit for you!

Tuesday, April 10, 2007

Fairness

Okay, so I lied about the contents of my next post. I bet no one even notices. Anyone even reading these? Yeah, right.

I was asked a question about fairness. The notion of fairness is one where there are some number of individual entities trying to make progress (in our case, threads), and there is some level of guarantee that this will actually happen.

Java has no meaningful fairness guarantee. A Java virtual machine is allowed to let a thread hog all of the CPU time. If I have:


Thread 1:
while (true);

Thread 2:
System.out.println("Hi there!");

There is no guarantee that Thread 2 will actually ever print its message. This is so that Java implementors can create cooperative implementations, where context switching only happens when Thread.yield is called, or when blocking occurs.

(In point of fact, this applies almost exclusively to the Oracle Java server. The popular VM implementations from Sun and IBM are preemptive.)

You could even have a wait loop:

Thread 1:
while (!someBoolean) {
  try { wait(); } catch (InterruptedException e) {}
}

Thread 2:
someBoolean = true;
notify();

This might not even terminate, because wait() can wake up spuriously. The wait loop can sleep and wake up indefinitely, and never let the other thread proceed.

At some point, I will explain what this has to do with the Java memory model.

Saturday, March 31, 2007

Safe Construction and the JMM

Another question from a viewer. We have this code:

class C {
  private int x;

  public C(int v) {
    this.x = v;
  }

  public synchronized int get() {
    return this.x;
  }

  public synchronized void set(int v) {
    this.x = v;
  }
}

(For the actual question as it was worded, the short answer is, "yes")

The question was one I get a lot. He asked how one thread can safely read the initial write of v to x. That is to say, if we have:


Thread 1:
globalRef = new C(5);

Thread 2:
C ref = globalRef;
if (ref != null) {
  int r1 = ref.x;
}


How can we change this code to guarantee that the read in Thread 2 will see the value 5 for x (assuming that ref is not null)?

Let's back up for a minute, and talk about why, in the code as written, the read in Thread 2 will not see the value 5 for x. Under the Java memory model, for one thread to be guaranteed to see an update made to a variable by another thread, there must be a happens-before relationship between the update and the read. This means that sometime after the write, there must be an action that performs what we call a "release", and sometime before the read, there must be an action that performs a corresponding "acquire".

The easiest way to get a release and a corresponding acquire is to use locking on the same variable. In this case, we would do something like this:


Thread 1:
synchronized (C.class) {
  globalRef = new C(5);
}

Thread 2:
synchronized (C.class) {
  C ref = globalRef;
  if (ref != null) {
    int r1 = ref.x;
  }
}


There is a release when the lock is released at the end of Thread 1, and an acquire when the lock is subsequently acquired at the beginning of Thread 2. This provides the guarantee that the updates that happen-before the release will be seen by the reads that happen-after the acquire.

For the really wonky -- what could happen? Well, we have


globalRef = new C(5);


which is really:


localRef = <allocate>;
localRef.x = 5;
globalRef = localRef;


There's no dependency between the second and third lines there, so we could easily have:


localRef = <allocate>;
globalRef = localRef;
localRef.x = 5;


And another thread could read globalRef and see the reference to the object before x is assigned the value 5.

Next post -- why this explanation is bad, bad, bad! Any guesses?

Tuesday, March 27, 2007

Volatile Fields and Synchronization

A question from a viewer. In my talk, I had a class that looked like this:

class Future {
 private volatile boolean ready;
 private Obj data;
 ...
 public synchronized void setOnce(Obj o) {
  if (ready) throw ...;
  data = o;
  ready = true;
 }
 public Object get() {
  if (!ready)
   return null;
  return data;
 }
}


The setOnce method is executed by one thread, and get is executed by another. The point of this example is to examine what difference the volatile modifier on ready makes. In this case, if ready is not marked volatile, the compiler can reorder the writes to data and ready. This has the result that the thread that invokes get can see the value true when it reads ready, even if data has not been set yet. Messy.

The questioner asked why setOnce is synchronized. This is somewhat orthogonal to the example, which is why I didn't mention it in the talk. However, I thought it was important enough to include it on the slide. If this method is not synchronized, and multiple threads call it, then those threads can interfere with each other. The take away message here is that volatile is not a magic wand that can take away your concurrency issues with no additional cost.

Why is it in there, if I specified a single writer when I talked about it? What happened here was that I wrote this, said, "Do people really want to write this code without the synchronized block?", answered "No," and put in the synchronized block.

Monday, March 26, 2007

JMM Talk at Google


This fellow is rather handsome, dashing and well-informed
. If you would like a primer on the Java memory model, this is a good place to start. I was running out of time toward the end, though, so the last 15 minutes may be a little rushed. Perhaps I'll do a two-part followup at some point.

One thing I forgot to say, at the very beginning: The Java memory model defines how threads interact through memory. That's why it's called the Java memory model. It has nothing to do with garbage collection or the object model (except where multithreading is important to those things).

If the mood strikes, please feel free to ask questions about the memory model, or anything I said in the talk. I put a note in the talk to ask questions at:

http://jeremymanson.blogspot.com/2007/03/jmm-questions-go-here.html.

Thursday, March 22, 2007

HashMap Behavior in JDK6

It is important to read the documentation. Lots of people don't seem to do this, and they get bitten by it.

The iterator for java.util.HashSet is documented as follows:

Returns an iterator over the elements in this set. The elements are returned in no particular order.

Of course, I've now seen code that ignores this, so I thought I would draw a few underlines for the kind of folks who miss it. I wish to state for the record that I did not write this code.

It turns out that HashSets (and HashMaps) don't just use the user-defined hashCode as the hash value for objects they are passed. They rehash the number that method returns with their own hash function, to defend against poor quality hash functions. This makes a lot of sense, as many people write bad hash functions.

Anyway, this means that when they change this function, objects will hash to different places. Then, when iterators visit these items, they will be visited in a different order.

I wouldn't be posting this unless they had changed it between JDK5 and JDK6, of course. For a quick illustration:

import java.util.HashSet;

public class Test {
  public static void main(String [] args) {
   HashSet<string> set = new HashSet<string>();
   set.add("axons");
   set.add("bandrils");
   set.add("chumblies");
   System.out.println(set);
  }
}

Run that in JDK5, and it will print "[chumblies, bandrils, axons]". Run it in JDK6, and it will print "[axons, bandrils, chumblies]".

If you are having trouble switching from JDK5 to JDK6, and this comes up in your code, smack your forehead with your palm, say, "duh!", and move on with your life.

Wednesday, March 21, 2007

Loop Invariant Code Motion and the Java Memory Model

I gave a talk on the Java memory model this afternoon, and there was a question from the audience which I didn't answer to my own satisfaction, so I am answering it again, for my vast readership of none.

At issue was the following bit of code:

class Animator implements Runnable {
 private volatile boolean stop = false;
 public void stop() { stop = true; }
 public void run() {
  while (!stop) {
   oneStep();
   try { Thread.sleep(100); } …;
  }
 }
 private void oneStep() { /*...*/ }
}

This is a pretty common idiom. One thread calls run(), and another thread (eventually) calls stop(), which sets the stop flag to true, and halts the loop. This is fairly intuitive.

What is not intuitive about this code is that if the variable is not declared volatile, then regardless of whether another thread calls stop(), this code is not guaranteed to terminate. A compiler (in this case, likely Java's JIT compiler) can:
  1. Look at the run() method,
  2. Notice that stop is never updated, and therefore
  3. Hoist the read of stop outside the loop.
The resulting code would look something like this:

public void run() {
 if (!stop) {
  while (true) {
   oneStep();
   try { Thread.sleep(100); } …;
  }
 }
}

... which would likely run forever. But only if stop is not declared volatile.

Now, the questioner asked if I knew of any compilers that were likely to do this. Pretty much every modern compiler does loop invariant code motion -- basically, if it decides that some program instruction will evaluate the same on every loop iteration (as the read of stop does), then it will arrange the code so that program instruction is only executed once, before the loop.

What I said was that I didn't know if any particular compiler did this. What I didn't mention was that this is because, in this case, the compiler has a little extra work to do. It has to determine that the actions in the onestep() method don't affect the value of stop, and it also has to make the decision to do loop invariant code motion for the loop guard (there is no reason not to do code motion for the loop guard).

You are actually pretty likely to run into this sort of thing in practice. Whether you will on this particular example is a question for debate.

I feel a little better having gotten that off my chest.

JMM questions go here

I just gave a talk in which I said that people could post questions on the Java memory model here. So feel free to post JMM questions below!

Tuesday, March 13, 2007

C++ Is Getting More Like Java Every Day

Lawrence Crowl gave an interesting talk at Google last week. It would seem that the next revision of the C++ standard is getting:
  1. A new memory model,
  2. Garbage collection,
  3. Concepts (which are more-or-less type checking for templates. This roughly boils down to "generics")
  4. Unicode support,
  5. A regular expression library,
  6. A for-each loop!
This is all starting to sound terribly familiar. I'm starting to like that language! Now all they have to do is get rid of multiple inheritance and their awful operator overloading semantics, and replace the STL, and I'll be sold.

The new standard will be C++0x. They haven't decided whether x is 8 or 9 yet. Knowing the standards process, I imagine it will be 9. I hope they keep x to single digits...

Anyway, in addition to all of this, it is getting move semantics for rvalues, lots of new syntax for initializers, and buckets of other things. Check out the talk for more.