Friday, December 18, 2015

Why I don't like System.gc()

I wrote this for internal-to-my-company consumption, but I couldn't find a reasonable place to put it. And then I remembered I had a blog.

The question to me was whether is ever made sense to do a System.gc(), and, more specifically, whether it made sense to do System.gc() followed by System.runFinalization().

I can't claim it is always wrong. It suffers from what some people call the Politician's syllogism: Something must be done, this is something, therefore we must do it. There's no control over GC other than a big red button labeled "System.gc()", so this is always the something that must be done.

In practice, the biggest problem with System.gc() is that what it does is rather unpredictable. In fact, there's no guarantee that it will do anything. In fact, Hotspot has a -XX:DisableExplicitGC flag to stop it from doing anything. So, it can sometimes solve an immediate problem in a system-dependent way, but you are introducing a significant source of overhead with ill-defined behavior into your code, and that behavior can change with a change of flags or a change of runtime.

Using the concurrent collector (our default at G), System.gc() forces a Full, stop-the-world GC instead of a concurrent GC. By calling System.gc(), you are stopping the world for potentially much longer than just letting a concurrent collection happen would. At G, our Full STW GC under CMS is parallel, and the rest of the world's is serial, so it will finish faster for us than it will for everyone else. Except the rest of the world has the parallel GC by default, so it will be parallel. Except that they're changing it to G1 for Java 9, which has a serial full STW GC, so it won't be parallel.

Of course, if you are using the concurrent collector, it matters if the user passes -XX:+ExplicitGCInvokesConcurrent, in which case System.gc() will happen concurrently, which would mean that the application will continue executing, and you will probably call runFinalization() before the GC has enqueued the finalizable objects.

It should be relatively clear at this point that you can't depend on it for performance, because you can't guarantee its performance, and you can't depend on it for correctness, because you have no idea what it is going to do.

If you really want to wait for a GC to occur and for some percentage of objects to be finalized, you probably to do something like:
  • Create a CountDownLatch with a count of 1.
  • Create an object whose finalizer calls countDown().
  • Make the object garbage immediately (bearing in mind that object lifetimes aren't what you think they might be).
  • Call System.gc().
  • Call await() on the CountDownLatch.
  • Call runFinalization to make sure all of the pending finalizers are finished (but do it in a loop with a timeout!)
Which will probably work most of the time, unless System.gc() really doesn't do anything and a GC is never triggered (or, worse, only young-gen GC is ever triggered, so that objects in the old gen aren't collected or finalized, and you think something has been finalized when it hasn't).

In short, usage should be carefully considered.

As for runFinalization: it suffers from the same "I have one button for this" issue. What it does (in Hotspot) is add a thread as a worker for processing finalization, and then return when that thread is done (i.e., the finalization queue is empty, although note that the finalization queue being empty is not the same as all finalizers having been run, since there are two other threads processing them).

How much are you buying by blocking until the finalization queue is empty? Do you really want to stop the application for this? Is it going to do what you think it is going to do? Are you sure the finalizer you care about is enqueued?

Hey, if you have lots of GCs, and expensive finalizers, one could even imagine the finalization queue not ever being empty. This is fairly unlikely, in practice, but would be really, really nasty if it happened.

Anyway, let the buyer beware.  Google tells me that the first couple of hits for "System.gc()" are also nasty warnings, but I had already written this, so there is no harm in posting it anyway.

Tuesday, July 9, 2013

Lightweight Asynchronous Sampling Profiler

I finally got off of my lazy elbows and wrote a proof-of-concept sampling profiler for public consumption based on the Hotspot JVM's AsyncGetCallTrace interface.

The benefit of using this interface to write a profiler is that it can gather stack traces asynchronously. This avoids the inaccuracies of only being able to profile code at safe points, and the overhead of having to stop the JVM to gather a stack trace. The result is a more accurate profiler that avoids the 10-20% overhead of something like hprof. For a more thorough explanation as to why this is interesting, see my 2010 writeup on the subject.

Another feature of the profiler is that it will tell you what stack traces are being executed, rather than, like many sampling Java profilers, what stack traces could be executed. That is, with traditional profilers, if you stop the VM and ask it for stack traces, it will tell you all of the stack traces from all of the threads that are not currently blocked. This profiler tells you the stack trace from the thread that got interrupted when you asked for the stack trace. The tradeoff is reasonably important if you are trying to think about thread starvation and scheduling; with this approach, you get to see what isn't being scheduled. With the "get everything" approach, you get more samples, so there is a meaningful benefit to both approaches.

Code is here. I'm not likely to invest a great deal of time in making it wonderful, but patches from interested parties are welcome.

Edited to add: Some of the commenters below wonder why the stack trace grabber can't print out the stack traces when it gets them. The answer to this lies in what I meant when I called this function asynchronous. I was a little (deliberately) vague on what "asynchronous" means. When I say that the timer that goes off to grab a stack trace interrupts a running thread asynchronously, I mean that it can interrupt the thread at any point. The thread can be in the middle of GC, or the middle of adding a frame to the stack, or in the middle of acquiring a lock. The reader can imagine that writing something asynchronous-safe is somewhat more tricky than writing something thread-safe: not only do you have to worry about what other threads are doing, you also have to worry about what the current thread is doing.

Needless to say, there are a lot of things you can't (or shouldn't) do asynchronously. For example, you can't call malloc, because if the thread is already in the middle of executing malloc when you interrupt it, you can destroy malloc's internal data structures. In fact, for the standard C library functions, the POSIX standard specifies a list of functions that have to be async-safe (see man 7 signal for the list, if you are curious). Note that functions like printf are not async-safe, so you can't call them asynchronously. As far as I know, AsyncGetCallTrace is pretty much the only non-trivial async-safe JNI/JVMTI call. Others, including the standard JVMTI stack trace gathering function and the functions that make sense out of what AsyncGetCallTrace returns, are not async-safe, and are likely to crash your program if you try to call them asynchronously.

For this programming exercise, I was lazy, and printed the stack traces at VM exit. That doesn't mean that a sufficiently clever approach can't get the stack traces incrementally. A separate, synchronous thread can read the data that was written out asynchronously. You can't use a standard blocking lock to protect the data, though, because standard lock implementations aren't async-safe. You could use a spin lock, but you would have to be careful about the async handler interrupting a thread that holds the spin lock. In that case, the async handler won't be able to acquire the lock, so it would have to do a trylock and give up if it can't complete the acquire. There are also clever things you can do with atomic operations if you apply some creativity; I'll leave those as an exercise for the reader.

Tuesday, February 7, 2012

Transactional Hardware on x86

While I'm posting anyway, I might as well mention the big concurrency news of the day: the new transactional memory specification from Intel. Transactional memory sounds a lot like what it is. Your code begins a transaction and continues to execute it until it ends, performing some memory writes along the way. If there are no conflicting transactions (i.e., transactions that write to the same memory that your code wrote) executing at the same time as yours, your transaction will end normally and commit your memory writes. If there are conflicting transactions, your transaction will abort and roll back your writes.

Transactional memory on x86 will come in two flavors:
  1. Hardware Lock Elision (HLE), which consists of XACQUIRE and XRELEASE instruction prefixes. These can optimistically turn lock regions into transactions. What's the advantage? Well, transactions can execute concurrently, as long as they aren't writing to the same memory. Locks are serialized: only one thread can be in a lock region at a time. So this mechanism can (hypothetically) execute more code concurrently. Also - and perhaps more importantly - it should be much cheaper to acquire a lock using XACQUIRE than it is to acquire it without, since with XACQUIRE you don't have to do anything other than tell the system to start looking out for conflicts. Between the two advantages, HLE should be able to provide a nice performance boost. These are backwards compatible, so that software written to use them can be run on earlier hardware.
  2. Restricted Transactional Memory (RTM), which consists of XBEGIN, XEND and XABORT instructions. These have the traditional transactional memory semantics: you begin a transaction, and if you abort before you end it, your writes are undone. These are very flexible, but are not backwards compatible (i.e., you can't use them on older hardware).
I like the idea of hardware-level transactions / atomicity. Compilers can be written to take advantage of them, and they can be a win for people implementing synchronization primitives and doing very low-level multithreading. They have the potential of making synchronization very cheap when there isn't much contention. AFAICT, Azul has claiming very large benefits from their HLE-style synchronization for years (edited: They claim <10%: see this presentation on hardware transactional memory from Cliff Click).

There is also likely to be a renaissance in lock-free data structures. Currently, lock-free data structures are pretty much all built on top of one non-blocking atomic instruction: the compare-and-swap (LOCK: CMPXCHG on x86, or AtomicReference.compareAndSet() to Java programmers). Trying to shoehorn everything you might want to do atomically into a single instruction that operates on one word has led to some very interesting data structures over the last 20 years or so, but is really the wrong way to solve the problem.

The flip side of this is that I don't think these will have much of an impact on most user-level synchronization. The HLE is really just an optimization for current lock implementations. The RTM aborts if you execute any of a large number of instructions (like PAUSE or CPUID, or many operations that affect control flow), or if there is a conflicting write to the cache line (which may not have any logical connection to the data the programmer cares about): the average programmer who deals with multithreaded code can't reasonably be expected to think about what is going on at the microarchitectural level.

Also, as with most transactional memory systems, RTM only deals with memory writes to a limited number of cache lines. If you have to deal with, for example, I/O, or if you touch too many cache lines, you have to write an abort handler that deals with rollback correctly. My feeling is that this will probably be far too difficult for most programmers.

Transactions' usefulness also depends enormously on the quality of their implementation. I'd love to see some hardware, but it isn't expected to be released until 2013.


Monday, February 6, 2012

Update to the Allocation Instrumenter

A couple of years ago, I open-sourced a tool that lets you instrument all of the allocations in your Java program. Some quick updates today:
  1. It now supports JDK7.
  2. You can now optionally use it to instrument constructors of a particular type, instead of allocations. So, if you wanted to find out where your Threads were being instantiated, you could instrument them with a callback that invokes Thread.dumpStack().

I know that most of the readers of this blog are more interested in concurrency, and that it has been a very, very long time since I posted. The spirit is willing, but the flesh is insanely busy... I have it in mind to do a post that discusses a bunch of strategies for doing efficient non-blocking concurrency in C++, because the attempts are very amusing, but I haven't had the chance to sit down and do it.

Friday, June 3, 2011

Scala / Java Shootout

I should probably comment on this paper from my colleague Robert Hundt, where he writes the same benchmark in Java, Scala, Go, Python and C++ (TL;DR here). Short version: Initially, C++ was faster than Scala, and Scala was faster than Java. I volunteered to make some improvements, which ended up with the C++ and Java implementations being about the same speed. Some C++ hackers then went and improved the C++ benchmark by another ~6x, and I didn't think I should go down that road.

Imagine my surprise when I got this email from a friend of mine this morning:

Date: Fri, 3 Jun 2011 03:20:48 -0700
Subject: Java Tunings
To: Jeremy

"Note that Jeremy deliberately refused to optimize the code further,
many of the C++ optimizations would apply to the Java version as
well."

Slacker.


Gee, thanks for making me look good, Robert. (Actually, I had approved that language, but I didn't realize anyone would actually read the paper. :) )

Anyway, I was not just being obdurate here; there was a thought process involved. I should probably document it.
  1. The benchmark did not follow the rules of Java benchmarking. At some point, I should write a post about the rules of Java benchmarking (maybe I have? I don't know). Basically, it timed the full execution of the program, including startup time, JIT compilation time, and GC time. This isn't what you are supposed to do in Java or Scala, so I was very uncomfortable with the experimental setup. In 2011, if you care about startup time, then you probably shouldn't be using Java. The program ran very briefly, so the effect was measurable (once I got it to run fast enough).ETA: I am informed that later versions of this benchmark took this into account. Apologies for the confusion.

  2. It's kind of silly to compare C++ with Java. The paper is about Scala, of course, but the immediate internal reaction at Google was a discussion of how much worse Java was than C++. As it happens, they really do target two different niches. Programmers are more productive in Java (if they know both languages well). It's easier to test Java code. It scales to large code sizes better. The tradeoff is that you are only going to get gcc -O2 level performance, and you have to be careful about garbage collection. In many, many cases, this tradeoff isn't a big deal. A single-threaded performance arms race between the two doesn't make sense. This is the primary reason I stopped trying to make improvements.

  3. The benchmark did not use Java collections effectively. It was generating millions of unnecessary objects per second, and the runtime was swamped by GC overhead. In addition to the collections-related fixes I made (as described in the paper), the benchmark also used a HashMap<Type, Integer> when it should have just stored a primitive int in Type; it also used a LinkedList where it should have used an ArrayDeque. In fact, after being pressed on the matter by other Googlers (and seeing the response the C++ version got), I did make these changes to the benchmark, which brought the numbers down by another factor of 2-3x. The mail I sent Robert about it got lost in the shuffle (he's a busy guy), but he later said he would update the web site.


It is this last point that made the difference between Java and Scala. They use the same runtime, so the compiled performance is going to be roughly the same. However, the choice of data structure matters! From the perspective of Scala vs. Java performance, the takeaway isn't that "Java is slow", it is that "you can write bad code in any language" (with apologies to Robert, who is one of the smartest guys around, but isn't a Java programmer).

It is also "you can write bad code in any language" that inspired me to make any improvements. Robert was not a C++ newbie, but he was a Java newbie, so I thought that the comparisons that this was inspiring between C++ and Java performance basically made no sense. I didn't make the changes to demonstrate that Java could do better than C++, as I know that's not true. I made them to show that the playing field wasn't quite as uneven as people tend to think, and that algorithmic changes tend to make more difference than language changes.

These observations don't actually contradict the point that Robert was trying to make. In fact, they underline it. He was trying to say that Scala is very newbie-friendly when you want performance, as it balances the two well. That was clearly true in this case: as a newbie in both Java and Scala, Robert chose more performance-friendly data structures in Scala than in Java. In fact, in large measure, Scala chose the data structures for him, which is probably even better.

The other languages strike terrible balances in this area. Python and Go aren't on the performance radar at all (Python isn't compiled, and Go is young, yet). C++ isn't newbie friendly at all, ever.

Robert, of course, is a very odd choice for newbie, as he is an experienced compiler writer. Havlak is also an odd choice for newbie program, as it is a relatively sophisticated compiler benchmark. Perhaps the moral turns out to be that newbies who are also experiened compiler writers should look carefully at Scala?

Friday, May 13, 2011

Java Puzzlers@Google I/O

Josh Bloch and I gave one of his Java Puzzlers talk at Google I/O this year. If you hate Java, you can waste a perfectly good hour listening to us make unfunny jokes at its expense.

Tuesday, April 5, 2011

Cliff Click in "A JVM Does What?"

Cliff Click gave a very good talk at Google last week. For your viewing enjoyment:





The slides are available here.

(For those who watched it: The inventors of bytecode were in the audience. I think he was just used to making that joke elsewhere.)

Friday, August 27, 2010

JavaOne Talk Cancelled

Because I am a Google employee, my talks are not happening. Apologies to any interested parties.

I'm going to try to give the talks in another venue, or at least to use the blog as a venue for some of the topics I wanted to cover.

Monday, August 2, 2010

Why Many Profilers have Serious Problems (More on Profiling with Signals)

This is a followon to a blog post I made in 2007 about using signal handling to profile Java apps. I mentioned in another post why this might be a good idea, but I wanted to expand on the theme.

Hiroshi Yamauchi and I are giving a talk at this year's JavaOne. Part of this talk will be about our experiences writing profilers at Google, and in preparing for it, I realized my previous entry on this subject only told half the story.

The topic of the original post is that there is an undocumented JVMTI call in OpenJDK that allows you to get a stack trace from a running thread, regardless of the state of that thread. In Unix-like systems, you can use a SIGPROF signal to call this function at (semi-)regular intervals, without having to do anything to your code. The followon post, intended to describe why you would use this, described a couple of things that are true:

  • That it tells you what is actually running on your CPU, not just what might be scheduled, which is what profilers that take a snapshot of every running thread do. This increases its accuracy dramatically. For example, there are many profilers that report a great deal of your time spent in blocking mechanisms, simply because threads frequently become scheduled as they exit blocking functions. However, those sorts of threads are all "ready to be scheduled" rather than "actually running". As a result, CPU time is charged to those threads inaccurately.

  • That it reports time spent in GC and in other VM-specific activities (like JIT compiling). At Google, we find that substantial amounts of our users' CPU time is spent in VM activities. In fact, GC brings the occasional Google service to its knees (not mentioning any names).

I now realize that I left all of the JIT compiler-related reasons out of that post. This post is an attempt to repair that problem, and describe why profiling with AsyncGetCallTrace and signals is a better approach than the typical state of the art.

Sampling profilers that take thread dumps from individual threads often do so by injecting thread stack trace gathering code into your code. This is suboptimal in many ways:

  • It introduces overhead into your code. This is the most obvious way. As soon as you introduce overhead, you are changing the performance characteristics.

  • Changing the size of the code changes the optimizing decisions made by the JIT. For example, one of the most important performance improvements made by a just-in-time compiler is when it inlines small methods. It won't inline if the method gets too large. Introducing sampling into the methods changes the size, and therefore affects inlining decisions, changing the performance characteristics of your code.

  • Changing the size of the code affects code layout. For relatively obvious reasons of caching and memory alignment, the placement of your code can affect its performance. Bigger code often means worse performance.

At this point, you might be thinking that the best thing to do is use something like Thread.getAllStackTraces and take your chances with its inaccuracy (which is what the aforementioned sampling profilers do). However, there is one more factor, which is significant in any of the documented and built-in stack trace sampling methods - that system-wide stack trace sampling happens for any given thread when it is at a safe point. Safe points are places in the code that the VM knows it can do a whole host of things - like initiate garbage collection - safely. The location of these safe points is determined by the JIT. It often puts them in places that aren't ideal for CPU profiling. For example, there may be a hot loop in your code that the JIT decides should not be interrupted by a safe point. If you use most standard profilers, this hot loop will never get profiled! As a result, the placement of safe points affects the sampling quality of standard sampling profiling techniques.

(Another interesting point, as made by Todd Mytkowicz and Amer Diwan of the University of Colorado: since JIT behavior really depends on everything in the system, and most profilers are part of the system, the decision about where to put a safe point will end up depending on which profiler you are using. This can make the results of the profilers clash violently: because of their differing behaviors, the safe points end up in different places, and the profilers end up tracking different places. See Mytkowicz and Diwan's recent PLDI paper for details.)

All of this JIT talk basically mirrors the Heisenberg Uncertainty Principle, but for profiling. You can either have exact information, or you can have your code behave as it is supposed to behave, but not both. You need to minimize the effects, so you want a profiler that doesn't interfere with or depend on JIT decisions at all. AsyncGetCallTrace fits this bill - you call it, and it returns a result. You don't call it directly from your code. It doesn't wait for a safe point. It doesn't change your code. The JIT doesn't care.

ETA: I believe that the profiler bundled with Sun Studio uses the AsyncGetCallTrace method call, but I'm not exactly sure how.

Sunday, February 7, 2010

Garbage Collection, [Soft]References, Finalizers and the Memory Model (Part 2)

In which I ramble on at greater length about finalizers and multithreading, as well as applying that discussion to SoftReferences. Read Part 1 first. It explains why finalizers are more multithreaded than you think, and why they, like all multithreaded code, aren't necessarily going to do what you think they should do.

I mentioned in the last entry that a co-worker asked me a question about when objects are allowed to be collected. My questioner had been arguing with his colleagues about the following example:

Object f() {
Object o = new Object();
SoftReference<Object> ref = new SoftReference<Object>(o);
return ref.get();
}
He wanted to know whether f() was guaranteed to return o, or whether the new Object could be collected before f() returned. The principle he was operating under was that because the object reference was still on the stack, the object would not be collected. Those of you who read the last entry will know better, of course. To put it in the words of the Java Language Specification, optimizing transformations of a program can be designed that reduce the number of objects that are reachable to be less than those which would naively be considered reachable.

In practice, what this means is that the VM can:
  1. Decide the variable o is no longer used after the last time it appears in the program,

  2. Clear the SoftReference immediately after it is constructed, and

  3. Return null

Those of you who read my entry on how Hotspot decides to clear SoftReferences will know that this is extremely unlikely in Hotspot, where a SoftReference always survives for at least one collection. But in other VMs? I have no idea.

Anyway, the obvious question is: how do you prevent the collector from collecting this object? There are a few ways:

  1. The obvious answer is to return o instead of ref.get() (this was harder in the original code, which I've simplified to make my point).

  2. Another way is way is to make o reachable from a static field. That makes the object rather obviously reachable.

  3. A third way is to give the object a finalizer, and do synchronization in that finalizer, and then do synchronization on that
    object when you want the object to stay alive.

That last one probably requires a few more words. We wrote that into the spec so that the Finalizer Guardian idiom would work. The Finalizer Guardian idiom is described in Josh Bloch's invaluable book Effective Java (lgt amazon. Seriously, if you don't have a copy, go get one).

Anyway, we put a nice little exegesis on the link between the Finalizer Guardian and this rule into the language spec, so just go ahead and read it there.

Having said all of that, I will agree that this is a massive pain. It would be nice if keeping objects reachable-and-not-collected were a little easier. Doug Lea has proposed a method in his new Fences API called reachabilityFence. This method would keep an object reachable (in a garbage collection sense) until execution reached that point in the code. I'm on the fence about the Fences API, but I think that this particular method (which he alternatively called keepAlive) would be a useful tool.

Sunday, January 31, 2010

Garbage Collection, References, Finalizers and the Memory Model (Part 1)

A little while ago, I got asked a question about when an object is allowed to be collected. It turns out that objects can be collected sooner than you think. In this entry, I'll talk a little about that.

When we were formulating the memory model, this question came up with finalizers. Finalizers run in separate threads (usually they run in a dedicated finalizer thread). As a result, we had to worry about memory model effects. The basic question we had to answer was, what writes are the finalizers guaranteed to see? (If that doesn't sound like an interesting question, you should either go read my blog entry on volatiles or admit to yourself that this is not a blog in which you have much interest).

Let's start with a mini-puzzler. A brief digression: I'm calling it a mini-puzzler because in general, for puzzlers, if you actually run them, you will get weird behavior. In this case, you probably won't see the weird behavior. But the weird behavior is perfectly legal Java behavior. That's the problem with multithreading and the memory model — you really never know what the results of a program will be from doing something as distasteful as actually running it.

Anyway, suppose you have a class that looks like this:

class FinalizableObject {
int i; // set in the constructor
int j; // set by the setter, below
static int k; // set by direct access
public FinalizableObject(int i) {
this.i = i;
}
public void setJ(int j) {
this.j = j;
}
public void finalize() {
System.out.println(i + " " + j + " " + k);
}
}

And then you use this class, thus:

void f() {
FinalizableObject fo = new FinalizableObject(1);
fo.setJ(2);
FinalizableObject.k = 3;
}

Let's say that by some miracle the finalizer actually runs (Rule 1 of why you don't use finalizers: they are not guaranteed to run in a timely fashion, or, in fact, at all). What do you think the program is guaranteed to print?

Those of you who are used to reading these entries will realize immediately that unless they actually already know the answer, they have no idea. Let's try to reason it out, then.

First, we notice that the object reference fo is live on the stack when all three variables are set. So, the object shouldn't get garbage collected, right? The finalizer should print out 1 2 3, yes?

Would I have asked if that were the answer?

It turns out that the VM, as usual, is going to play some tricks here. In the words of the JLS, optimizing transformations of a program can be designed that reduce the number of objects that are reachable to be less than those which would naively be considered reachable. What this means is that the VM is going to make your object garbage sooner than you think.

The VM can do a few things to effect this (yes, this is the correct spelling of effect). First, it can notice that the
object is never used after the call to setJ, and null out the reference to fo immediately after that. It's reasonably clear that if the finalizer ran immediately after that, you would see 1 2 0.

That's not the end of it, though. The VM can notice that:

  • This thread isn't using the value written by that write to j, and

  • There is no evidence that synchronization will make this write visible to another thread.

The VM can then decide that the write to j is redundant, and eliminate it that write altogether. Woosh! You get 1 0 0.

At this point, you are probably expecting me to say that you can also get 0 0 0, because the programmer isn't actually using the write to i, either. As a matter of fact, I'm not going to say that. It turns out that the end of an object's constructor happens-before the execution of its finalize method. In practice, what this means is that any writes that occur in the constructor must be finished and visible to any reads of the same variable in the finalizer, just as if those variables were volatile. This paragraph originally read incorrectly

The immediate question is, how does the programmer avoid this insanity? The answer is: don't use finalization!

Okay, that's not enough of an answer. Sometimes you need to use finalization. There's a hint several paragraphs up. The finalizer takes place in a separate thread. It turns out that what you need to do is — exactly what you would do to make the code thread-safe. Let's do that, and look at the code again.

class FinalizableObject {
static final Object lockObject = new Object();
int i; // set in the constructor
int j; // set by the setter, below
static int k; // set by direct access
public FinalizableObject(int i) {
this.i = i;
}
public void setJ(int j) {
this.j = j;
}
public void finalize() {
synchronized (lockObject) {
System.out.println(i + " " + j + " " + k);
}
}
}

And then you use this class, thus:

void f() {
synchronized (lockObject) {
FinalizableObject fo = new FinalizableObject(1);
fo.setJ(2);
FinalizableObject.k = 3;
}
}

The finalizer is now guaranteed not to execute until all of the fields are set. When that sucker runs, you will see 1 2 3.

Oddly, I've been writing for almost an hour and I haven't gotten to my coworker's question yet. In the interests of brevity, I'll make this a series. More later.

Wednesday, January 6, 2010

A Note on the Thread (Un)safety of Format Classes

I recently got a note on another blog post asking this question:

I have a general question on the thread safety and this is not directly related with your blog. I would appreciate if you could post it on your blog. I have a class that has only one static method that accepts two string parameters and does some operation locally (as shown below). Do I need to synchronize this method?

public class A {
  public static boolean getDate(String str, String str2){
    SimpleDateFormat formatter =
      new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
    boolean isBefore = false;
    Date date1;
    Date date2;
    try {
      date1 = formatter.parse(str);
      date2 = formatter.parse(str);
      isBefore = date1.after(date2);
    } catch (Exception e) {
      e.getMessage();
    }
    return isBefore;
  }
}


Since it had nothing to do with the blog post in question, I'll answer it here.

The questioner is clearly asking about the well-documented lack of thread safety of the Format classes. This is a long-standing bug at Sun. If you create a Format object (or a MessageFormat, NumberFormat, DecimalFormat, ChoiceFormat, DateFormat or SimpleDateFormat object), it cannot be shared among threads. The above code does not share its SimpleDateFormat object among threads, so it is safe.

If the formatter field had been a static field of the class, the method would not be thread-safe. However, this way, you have to create a (relatively) expensive SimpleDateFormat object every time you invoke the method. There are many ways around this. One answer (if you want to avoid locking) is to use a ThreadLocal:

private static final ThreadLocal formatters =
new ThreadLocal() {
@Override public SimpleDateFormat initialValue() {
return new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
}
};
public static boolean getDate(String str, String str2) {
SimpleDateFormat formatter = formatters.get();
...
The even better answer is to use the thread-safe Joda time libraries, instead. I hate to recommend using a non-standard library when there is an equivalent standard one, but this Format behavior is pretty broken.

Thursday, December 10, 2009

Allocation Instrumenter for Java

In brief: We've open sourced a tool that allows you to provide a callback every time your program performs an allocation. The Java Allocation Instrumenter can be found here. Give it a whirl, if you are interested.

One thing that crops up a lot at my employer is the need to take an action on every allocation. This can happen in a lot of different contexts:
  1. The programmer has a task, and wants to know how much memory the task allocates, so wants to increment a counter on every allocation.
  2. The programmer wants to keep a histogram of most frequently accessed call sites.
  3. The programmer wants to prevent a task from allocating too much memory, so it keeps a counter on every allocation and throws an exception when the counter reaches a certain value.

Because of the demand for this, a few of us put together a tool that instruments your code and invokes a callback on every allocation. The Allocation Instrumenter is a Java agent written using the java.lang.instrument API and ASM. Each allocation in your Java program is instrumented; a user-defined callback is invoked on each allocation.

The easiest way to explain this is with an example. Assume you have a program that creates 10 strings, and you want to instrument it:

public class Test {
public static void main(String [] args) throws Exception {
for (int i = 0 ; i < 10; i++) {
new String("foo");
}
}
}
To do this, you create an instance of the interface Sampler:

import com.google.monitoring.runtime.instrumentation.AllocationRecorder;
import com.google.monitoring.runtime.instrumentation.Sampler;

public class Test {
public static void main(String [] args) throws Exception {
AllocationRecorder.addSampler(new Sampler() {
public void sampleAllocation(int count, String desc,
Object newObj, long size) {
System.out.println("I just allocated the object " + newObj +
" of type " + desc + " whose size is " + size);
if (count != -1) { System.out.println("It's an array of size " + count); }
}
});
for (int i = 0 ; i < 10; i++) {
new String("foo");
}
}
}

You can then compile and run the program:

% javac -classpath path/to/allocation.jar Test.java
% java -javaagent:path/to/allocation.jar Test

The output will look something like this:

I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24
I just allocated the object foo of type java/lang/String whose size is 24

So, by my standards, it is really pretty easy to use. If you find it useful, please let me know!

Edited to add I noticed this on Twitter: Cool, even if it uses Ant (so probably I will never try it). This is funny, because I only added an ant buildfile so more people would try it. You can download the source and compile it with javac in about one line.

Monday, September 7, 2009

Cliff Click on Java-vs-C Performance

Cliff Click has a terrific post on Java-vs-native-code performance. Recommended.

Cliff is the Chief JVM Architect at Azul Systems, was the lead designer behind Hotspot's server JIT compiler, and is an all-around smart guy. His main drawback is that he (apparently) gets dragged into flamewars on YouTube's comment section.

Tuesday, July 7, 2009

How Hotspot Decides to Clear SoftReferences

I got asked about this twice in one day, and I didn't know the answer, so I sat down and puzzled it out a bit.

A SoftReference is a reference that the garbage collector can decide to clear if it is the only reference left to an object, and the GC decides (through some undefined process) that there is enough memory pressure to warrant clearing it. SoftReferences are generally used to implement memory-sensitive caches. A mistake many people make is to confuse it with a WeakReference, which is a reference that, if it is the only reference left to the object, the garbage collector will aggressively clear.

I got asked how Hotspot's GC actually decides whether to clear SoftReferences. Twice. On the same day. (Hotspot is the Sun / OpenJDK JVM). The answer was that I had no idea, so I had to go look it up. Here is what I found out; I hope it is a useful reference for someone (pun intended).

First, there is a global clock variable that is set with the current time (in millis) every time a garbage collection occurs. Every SoftReference has a timestamp field that is set to the current value of clock when it is accessed (when it is constructed or the get()) method is called. This gives a very coarse ordering over the SoftReferences; the timestamp indicates the last GC before they were accessed.

When a garbage collection occurs, the decision to clear a SoftReference is based on two factors:
  1. How old the reference's timestamp is, and
  2. How much free space there is in memory.
The calculation is pretty simple. If:
  • free_heap is the amount of free heap space in MB,
  • interval is the time between the last GC's clock and the timestamp of the ref we are currently examining, and
  • ms_per_mb is a constant number of milliseconds to keep around a SoftReference for each free megabyte in the heap
Then the decision is made by:
interval <= free_heap * ms_per_mb
To take an example, let's say that we have a SoftReference with a timestamp of 2000ms, the last GC's clock time was 5000ms, the ms_per_mb is 1000 and the free space is 1MB. We then test whether:
5000 - 2000 <= 1 * 1000
This is false (3000 > 1000), so we clear the reference.

Now let's say there is more free space — say, 4MB — and so less reason to clear the SoftReferences. The calculation is now
5000 - 2000 <= 4 * 1000
This is true (3000 <= 4000), so we don't clear the reference.

One thing to notice about this is that it implies that SoftReferences will always be kept for at least one GC after their last access. Why is that? Well, for the interval, we are using the clock value of the last garbage collection, not the current one. As a result, if a SoftReference has been accessed since the last garbage collection, it will have the same timestamp as that garbage collection, and the interval will be 0. 0 <= free_heap * 1000 for any amount of free_heap, so any SoftReference accessed since the last garbage collection is guaranteed to be kept. This is actually how this question came up; some of my colleagues notices their SoftReferences weren't being cleared during a garbage collection, and they didn't know why.

ETA: The above paragraph originally said "one Full GC", not "one GC". In our case, it was one full GC, because the objects being allocated were too big to be allocated in the young generation. In most cases, it will be one GC of the generation in which the object was allocated, which will usually be the new generation / TLAB.

Another thing to notice is that the value of 1000 for ms_per_mb is fairly arbitrary. It can be adjusted with the JVM flag -XX:SoftRefLRUPolicyMSPerMB=n. If you adjust it down, then free_heap * ms_per_mb will be smaller, and so SoftReferences are more likely to be cleared. If you adjust it up, you get the opposite effect.

Wednesday, June 17, 2009

Volatile Arrays in Java

I get asked a lot about how the volatile keyword interacts with arrays, so it is probably worth a blog post on the subject.

Those of you who have read my posts on volatile (Volatile Fields and Synchronization, Volatile Does Not Mean Atomic and, most importantly, What Volatile Means in Java) will have a pretty good idea of what volatile means, but it is probably worth it to provide a reminder.

Basically, if you write to a volatile field, and then you have a later read that sees that write, then the actions that happened before that write are guaranteed to be ordered before and visible to the actions that happen after the read. In practice, what this means is that the compiler and the processor can't do any sneaky reordering to move actions that come before the write to after it, or actions that come after the write to before it. See my post on What Volatile Means in Java for more detail.

With that out of the way, let's go through some examples of what you can do with volatile arrays:
volatile int [] arr = new int[SIZE];

arr = arr;
int x = arr[0];
arr[0] = 1;
The first lesson to learn, which will guide us here, is that arr is a volatile reference to an array, not a reference to an array of volatile elements! As a result, the write to arr[0] is not a volatile write. If you write to arr[0], you will get none of the ordering or visibility guarantees that you want from a volatile write.

What examples are there of a volatile write in the code above? Well, both of the writes to arr — the self-assignment and the write of new int[SIZE] — are volatile writes, because they are writing to arr, not one of its elements.

That explains where the volatile writes are. Where are the volatile reads in our example? It turns out that each of the lines after the declaration contains a volatile read:
arr = arr
This one is easy. The volatile read is on the right hand side of the assignment statement.
int x = arr[0];
This one is slightly more subtle. The volatile read is not the read of the array element. The right hand side of that assignment is a two step process. First, you read the array reference, then you read the 0th element of that array. The volatile read is the read of the array reference, not the read of the 0th element.
arr[0] = 1;
The previous example should give you a hint of where the volatile read is on this line. As in that example, the left-hand side is a two step process. First, you read the array reference, then you assign to the 0th element of that array. As odd as it seems, the read of the array reference is a volatile read.

The astute reader will notice that there is no actual way to get volatile write semantics by writing to an element of an array referenced by a volatile field. The easiest way to get volatile array semantics is to use the Atomic[Reference/Long/Integer]Array classes in java.util.concurrent.atomic, which provide volatile semantics for reads and writes of individual elements.

(Why not Float/Short/Double Array? With APIs, you never ask "why not", you ask "why". Meanwhile, you have 32- and 64-bit bit patterns, so the Float.floatToIntBits and Float.intBitsToFloat family of functions are your friends.)

These classes are somewhat problematic, though. If nothing else, you are endlessly boxing and unboxing values, which may make access expensive. Ugh — I really do know better than this, really! As a result, there is more to this story.

You may have noticed that I did provide a way of getting a volatile write above with just arrays: by writing out a self-reference. I have been asked if that technique can be leveraged to provide volatile access to array elements. Here's what that would look like:
// I wouldn't do this if I were you.
volatile int [] arr = new int[SIZE];

arr[0] = 1;
arr = arr;
This definitely does provide the volatile write. However, what good does it do you? The virtue of a volatile write is that a corresponding read can detect that it happened, and do something based on the new value. For example, you can use a volatile flag to force one thread to loop indefinitely until another one sets the flag. In this case, you can't actually detect that another thread performed the write, because it is writing the same value to the variable.

You can use sun.misc.Unsafe to emulate the behavior of a volatile array. But not everyone is working on a Sun VM. And they are trying to discourage the adoption of sun.misc.Unsafe, so I'm not going to put the code here.

Don't despair too much, though — Doug Lea and the clever folks involved in JSR-166 are working on better solutions for Java 7. More news as events warrant.

Tuesday, June 9, 2009

Welcome!

Mailinator's Paul Tyma linked to me. If you are following from that link, the relevant blog entry you are looking for is probably this one, specifically, the entry labeled "visibility".

Saturday, April 4, 2009

Faster Logging with Faster Logger Classes

Today, I'll discuss a little tweak I made to java.util.Logging that made my logging throughput double. I want to use it as an illustration that it often isn't very difficult to improve the performance of concurrent code by doing things that are actually pretty easy to do.

So, "I" have an application that is running a couple of hundred threads on an 8-core machine, and it wants to log about 2MB a second using java.util.Logger. When I say "I have", I actually mean "someone else has", because if "I" had to log a megabyte a second, there is no way I would use java.util.Logger to do it. Still, we all make our choices.

When I came to this code, it was already doing sensible things like buffering its output. It wasn't doing something more complicated, like using a dedicated logging thread. It could log about 1MB a second, and was chewing through CPU pretty rapidly. I just decided to run our profiler on the code and see where the bottlenecks were. Our profiler is based on the undocumented AsyncGetCallTrace profiling call in HotSpot, and can actually profile your code without interfering with its performance characteristics. This is nice, because you don't end up optimizing the wrong things. But I digress.

Anyway, the profiler showed that we were spending a LOT of time in Logger.log(), on lines that look fairly harmless, but aren't:

public void log(LogRecord record) {
if (record.getLevel().intValue() < levelValue || levelValue == offValue) {
return;
}
synchronized (this) {
if (filter != null && !filter.isLoggable(record)) {
return;
}
}

// Post the LogRecord to all our Handlers, and then to
// our parents' handlers, all the way up the tree.

Logger logger = this;
while (logger != null) {
Handler targets[] = logger.getHandlers();

if (targets != null) {
for (int i = 0; i < targets.length; i++) {
targets[i].publish(record);
}
}

if (!logger.getUseParentHandlers()) {
break;
}

logger = logger.getParent();
}
}
There are no fewer than four locks being acquired in this method. There is the obvious call to synchronized(this). logger.getHandlers() is also a synchronized method. getUseParentHandlers() is synchronized. getParent() is also synchronized. Acquiring and releasing these locks was killing our throughput!

It turns out that there are some very simple things you can do to eliminate the locks in this code without exposing yourself to any correctness issues:

  1. The filter field is protected by the lock on this. That lock is really only protecting one field; the lock is held while one line of code writes to it, and one line of code reads from it. If you want to make something that does the same thing (from the perspective of concurrency), you can make that field into a volatile. If you are worried about the filter variable changing in between when you read it and when you write it, you can read it to a local variable first:
    Filter theFilter = filter;  // filter is volatile
    if (theFilter != null && !theFilter.isLoggable(record)) {
    return;
    }
  2. There is a global lock protecting the handlers. It turns out that the only thing this is protecting is a rarely-updated array of Handlers. If you have a rarely updated concurrent array, you should be using the non-blocking CopyOnWriteArrayList instead.

  3. getUseParentHandlers() is synchronized for the exact same reason as the filter, and can be replaced with a volatile in the same way.


These are all pretty simple improvements, but they doubled my throughput. The changes are going to be incorporated into JDK7, and are, in fact, already in the downloads you can get from OpenJDK.

I should rush to say that I don't blame the original author for not making these changes; java.util.logger was added in JDK 1.4, before the addition of java.util.concurrent and before a rigorous definition of volatile. Plus, it really isn't designed for throughput.

Why am I blogging about pretty simple improvements? There are a few simple morals here:

  1. Learn your libraries. Don't have a synchronized array that is almost never updated, for example.

  2. Know when you can use tricky language features. Knowing that volatile can be used for simple, single-variable updates is a very useful thing to know. I've written about volatile frequently, go read those posts.

  3. It is actually worth it to participate in OpenJDK. We (or I, at least) have a tendency to assume that JDK code is really high quality and well-optimized. This is probably true in a lot of areas (java.util.concurrent, or java.lang.MostStuff, or java.util.YourFavoriteDataStructure), but there is still plenty of work to be done. If you get involved, you not only help yourself, but you also help everyone.

  4. I don't know, low-hanging fruit == good?

There is, by the way, a very large section of the definitely-recommended book Java Concurrency in Practice (lgt amazon) devoted to clever things you can do to speed up your multithreaded logging.

Saturday, February 28, 2009

Small Language Changes for JDK7

For those who haven't been paying attention, there is a new project from Sun to attract proposals for small language changes to JDK7. Even in the first day of open calls for proposals, the entries have been very interesting:
  • Strings in Switch, a proposal from Joe Darcy to be able to use Strings in switch statements.

  • Automated Resource Blocks, a proposal from Josh Bloch to be able to say things like:

    try (BufferedReader br = new BufferedReader(new FileReader(path)) {
    return br.readLine();
    }
    instead of:

    BufferedReader br = new BufferedReader(new FileReader(path));
    try {
    return br.readLine();
    } finally {
    br.close();
    }
    based on having BufferedReader implement a Disposable interface.

  • Block expressions, a proposal from Neal Gafter to be allow things like:
    double pi2 = (double pi = Math.PI ; pi*pi)**; // pi squared*

    Basically, the principle here is that the (parenthesis-delimited) block would return a value.

  • Exception handling improvements, another proposal from Neal Gafter to allow things like:
    try {
    doWork(file);
    } catch (final IOException | SQLException ex) {
    logger.log(ex);
    throw ex;
    }

  • Improved Type Inference, a proposal from me (although I can't claim credit for the idea) to be able to replace things like this:
    Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
    with things like this:
    Map<String, List<String>> anagrams = new HashMap<>();

  • A new syntax for wildcard variance, again from the prolific Neal Gafter, allowing the user to replace "? extends" with "out" and "? super" with "in". (Can you tell that Neal's been working on C#?)

I don't love all of these proposals, but I do love some of them, and I think it is great to see that the pent-up demand for work on the Java language finally has an outlet.

(For those of you following the various Closures proposals: Closures don't count as a small change, and are therefore not in scope for this JSR.)

ETA: The "use Scala instead" and "where is BGGA?" comments have been made, so please read the comments and feedback before reiterating them.

Sunday, December 14, 2008

Date-Race-Ful Lazy Initialization for Performance

I was asked a question about benign data races in Java this week, so I thought I would take the opportunity to discuss one of the (only) approved patterns for benign races. So, at the risk of encouraging bad behavior (don't use data races in your code!), I will discuss the canonical example of "benign races for performance improvement". Also, I'll put in another plug for Josh Bloch's new revision of Effective Java (lgt amazon), which I continue to recommend.

As a reminder, basically, a data race is when you have one (or more) writes, and potentially some reads; they are all to the same memory location; they can happen at the same time; and that there is nothing in the program to prevent it. This is different from a race condition, which is when you just don't know the order in which two actions are going to occur. I've put more discussion of what a data race actually is at the bottom of this post.

A lot of people think that it is okay to have a data races all over your code. By and large, it is a lot more dangerous than those people think it is. For example, the version of double checked locking that doesn't use volatile has a data race in it. As I've discussed before, that version is broken.

So, having said that, when can you have a harmless data race in your code? Here's some code from Sun's implementation of java.lang.String, which is available under the GPL:

public final class String {
/** The value is used for character storage. */
private final char value[];

/** The offset is the first index of the storage that is used. */
private final int offset;

/** The count is the number of characters in the String. */
private final int count;

/** Cache the hash code for the string */
private int hash; // Default to 0

public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;

for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}

// ...
}
The principle here is pretty simple. The program calls hashCode(). If it hasn't calculated the hash code yet, then it does, and then stores the value for future use. The people who wrote String didn't bother to calculate the hash code in the constructor, because that would make construction (which is done all the time) more expensive. Because Strings are immutable, we don't need to worry about the hash code potentially changing in the future.

Do you see the potential data race? Here's some code that triggers it:

class BenignDataRace {
final String s = "blah";

public void main() {
class CalcHash implements Runnable {
public void run() {
int h = s.hashCode();
}
}
Thread t = new Thread(new CalcHash());
Thread t2 = new Thread(new CalcHash());

t.start();
t2.start();
}

}
Two threads access s.hash without doing synchronization, and at least one of those threads is going to write to it. Item 71 in Effective Java calls this idiom racy single-check. In this case, though, it actually works. Why is this safe, when the double-checked locking example, which uses synchronization, isn't?

There are basically two parts to the answer, and they are both pretty simple:

  1. In this case, we don't care how many times we actually calculate the answer and assign it to the hash field. It just isn't a big deal if we happen to calculate it multiple times. Furthermore,

  2. Java has a special guarantee for fields that are 32-bit or fewer and object references — a thread will always see some value that was assigned to that field. As a result of this, a thread reading hash will either see 0 or the calculated value. Edited to add: I just noticed that this was misleading. An object reference will point to the right object, but the contents of the object aren't guaranteed to be correct unless the object is immutable. Please see my series on immutability.

That second item might sound obvious, but bear in mind that 64-bit values don't get that guarantee; this code would not work if hash were a long or a double.

Now, let's break the code:

public int hashCode() {
if (hash == 0) {
int off = offset;
char val[] = value;
int len = count;

int h = 0;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return hash;
}
All I did was make it so that the temporary variable was only used for calculating the hash code! How could that break the code?

What I've done here is to add an additional read: the second read of hash, before the return. As odd as it sounds, and as unlikely as it is to happen, the first read can return the correctly computed hash value, and the second read can return 0! This is allowed under the memory model because the model allows extensive reordering of operations. The second read can actually be moved, in your code, so that your processor does it before the first!

(Side note: the first version is better anyway, because it performs fewer memory reads. If you really care that much about performance, that's the way to go.)

What's the moral here? The moral is that this stuff is very confusing, and you have to work very hard to get it right. Make sure you understand exactly what your code does. Avoid data races wherever possible.




Here are the details on data races that I promised above:

A program has a data race in it when it is possible to have an execution of that program where:

  1. All of the instructions in that program occur in the order they occurred in the program (this is important for technical reasons),

  2. There are two accesses to the same field / array element,

  3. At least one of those accesses is a write, and

  4. There is no happens-before ordering between those actions.


I defined "happens-before ordering" in my last post. Basically, it means that there is nothing establishing an order between the two actions; for example, there is no locking protecting the accesses, no volatile writes and reads intervening and no thread starts and joins.