Skip to main content

Why Many Profilers have Serious Problems (More on Profiling with Signals)

This is a followon to a blog post I made in 2007 about using signal handling to profile Java apps. I mentioned in another post why this might be a good idea, but I wanted to expand on the theme.

Hiroshi Yamauchi and I are giving a talk at this year's JavaOne. Part of this talk will be about our experiences writing profilers at Google, and in preparing for it, I realized my previous entry on this subject only told half the story.

The topic of the original post is that there is an undocumented JVMTI call in OpenJDK that allows you to get a stack trace from a running thread, regardless of the state of that thread. In Unix-like systems, you can use a SIGPROF signal to call this function at (semi-)regular intervals, without having to do anything to your code. The followon post, intended to describe why you would use this, described a couple of things that are true:

  • That it tells you what is actually running on your CPU, not just what might be scheduled, which is what profilers that take a snapshot of every running thread do. This increases its accuracy dramatically. For example, there are many profilers that report a great deal of your time spent in blocking mechanisms, simply because threads frequently become scheduled as they exit blocking functions. However, those sorts of threads are all "ready to be scheduled" rather than "actually running". As a result, CPU time is charged to those threads inaccurately.

  • That it reports time spent in GC and in other VM-specific activities (like JIT compiling). At Google, we find that substantial amounts of our users' CPU time is spent in VM activities. In fact, GC brings the occasional Google service to its knees (not mentioning any names).

I now realize that I left all of the JIT compiler-related reasons out of that post. This post is an attempt to repair that problem, and describe why profiling with AsyncGetCallTrace and signals is a better approach than the typical state of the art.

Sampling profilers that take thread dumps from individual threads often do so by injecting thread stack trace gathering code into your code. This is suboptimal in many ways:

  • It introduces overhead into your code. This is the most obvious way. As soon as you introduce overhead, you are changing the performance characteristics.

  • Changing the size of the code changes the optimizing decisions made by the JIT. For example, one of the most important performance improvements made by a just-in-time compiler is when it inlines small methods. It won't inline if the method gets too large. Introducing sampling into the methods changes the size, and therefore affects inlining decisions, changing the performance characteristics of your code.

  • Changing the size of the code affects code layout. For relatively obvious reasons of caching and memory alignment, the placement of your code can affect its performance. Bigger code often means worse performance.

At this point, you might be thinking that the best thing to do is use something like Thread.getAllStackTraces and take your chances with its inaccuracy (which is what the aforementioned sampling profilers do). However, there is one more factor, which is significant in any of the documented and built-in stack trace sampling methods - that system-wide stack trace sampling happens for any given thread when it is at a safe point. Safe points are places in the code that the VM knows it can do a whole host of things - like initiate garbage collection - safely. The location of these safe points is determined by the JIT. It often puts them in places that aren't ideal for CPU profiling. For example, there may be a hot loop in your code that the JIT decides should not be interrupted by a safe point. If you use most standard profilers, this hot loop will never get profiled! As a result, the placement of safe points affects the sampling quality of standard sampling profiling techniques.

(Another interesting point, as made by Todd Mytkowicz and Amer Diwan of the University of Colorado: since JIT behavior really depends on everything in the system, and most profilers are part of the system, the decision about where to put a safe point will end up depending on which profiler you are using. This can make the results of the profilers clash violently: because of their differing behaviors, the safe points end up in different places, and the profilers end up tracking different places. See Mytkowicz and Diwan's recent PLDI paper for details.)

All of this JIT talk basically mirrors the Heisenberg Uncertainty Principle, but for profiling. You can either have exact information, or you can have your code behave as it is supposed to behave, but not both. You need to minimize the effects, so you want a profiler that doesn't interfere with or depend on JIT decisions at all. AsyncGetCallTrace fits this bill - you call it, and it returns a result. You don't call it directly from your code. It doesn't wait for a safe point. It doesn't change your code. The JIT doesn't care.

ETA: I believe that the profiler bundled with Sun Studio uses the AsyncGetCallTrace method call, but I'm not exactly sure how.

Comments

Noel Grandin said…
Any chance this profiler work will show up in an open-source project?
Jeremy Manson said…
We're considering it. The problem is that our team is very good at figuring out JIT issues and POSIX semantics, but not so great with the user interfaces. :)
pveentjer said…
Hi Jeremy,

what is your opinion about profilers that are able to integrate into the cpu to read out performance monitors available there. E.g. like Intel VTune.
Jeremy Manson said…
Hi Peter -

My understanding is that the hardware performance counters for VTune are associated with a particular method, not a stack trace / call graph. AFAICT - I'm neither a VTune developer or user - VTune's stack trace / call graph tracking mechanism uses JVMTI to be informed when a method is invoked, and then VTune keeps track of that information internally (that approach is accurate, but slooooow).

For the hardware counters, they use a different mechanism to associate hardware counters with specific methods very accurately and (I assume) quickly.
kontiky said…
Hello, Jeremy!
We've found strange problem in our real system
http://forums.sun.com/thread.jspa?threadID=5443540
Could you, please, explane this strange performance degradation and advise problem solution?
Unknown said…
Hi Jeremy,
what about VisualVM Sampler plugin? As I understand, it uses some sort of sampling, both stacktraces and heap dumps.
Did not find official page, but here's an announcement from author: http://java.dzone.com/announcements/visualvm-12-great-java
Jeremy Manson said…
This comment has been removed by the author.
Jeremy Manson said…
@Palo - VisualVM uses the same, documented methods, and has the same limitations (Sorry about my earlier post - I misread your comments).
bluedavy said…
does u mean google wrote a profiler for java just like google perf-tools,it can be used for online app,and do cpu profile and memory profile?
If yes,can u say more about the cpu & memory profile how to realize and give some screenshots?
Anonymous said…
Jeremy: I want to second the plea for you guys to open source your profiler. Please do this even if the Ui is not a slick GUI--I need the functionality now.

I was really impressed with that Mytkowicz paper when it came it some months ago. It was brilliant in more ways than you mentioned. In particular, the whole way that they define actionable profiles is spot on. Also, how they introduce deliberate delays in methods to test the hypothesis of what the profiler thinks are hot methods is way cool.
Jeremy Manson said…
@kontiky - That looks like a pretty poorly written microbenchmark. I would suggest you look at Cliff Click and Brian Goetz's various talks on how to write microbenchmarks in Java - it is kind of a black art.

@anonymous - we're working on it (it's a background task). We do suffer the problem of regular sampling, because our sampler samples once every 10ms without adding a jitter. We really should fix that, though.
Jeremy Manson said…
@BlueDavy - Yup, we use profiler that works internally with our pprof / Google perftools framework. Screenshots are pretty pointless, since they are just as ugly as the perftools stuff. However, no one is doing the updates on the public perftools framework to make it work with Java.
Do you use Oracle studio analyzer ? I think you mention it in your other blog post.

I am trying to understand if I can run a false sharing Java program and still look at the thread id's, cache lines to demonstrate false sharing. Are you aware of any other tools apart from intel's ?
Jeremy Manson said…
I don't currently use Oracle Studio, although I have tried it in the past.

I don't know of any publicly available tools, other than Intel's, to do hardware level profiling on a per-thread basis.

I guess that's not so helpful. Sorry!
Mike Dunlavey said…
The Mytkowicz and Diwan paper really bothers me. For example, it's definition of "hotness" seems to mean "self time percent". The whole reason gprof was invented 30 years ago was that self time is an inadequate diagnostic. Another way it bothers me is that it concentrates on methods, rather than lines of code. Yet another way is its shallow understanding of sampling statistics. Look here: http://scicomp.stackexchange.com/a/2719/1262
Jeremy Manson said…
I think it does a good job of summarizing some of the problems of the existing profilers, though, which is why I reference it.

Popular posts from this blog

Double Checked Locking

I still get a lot of questions about whether double-checked locking works in Java, and I should probably post something to clear it up. And I'll plug Josh Bloch's new book, too. Double Checked Locking is this idiom: // Broken -- Do Not Use! class Foo {   private Helper helper = null;   public Helper getHelper() {     if (helper == null) {       synchronized(this) {         if (helper == null) {           helper = new Helper();         }       }     }   return helper; } The point of this code is to avoid synchronization when the object has already been constructed. This code doesn't work in Java. The basic principle is that compiler transformations (this includes the JIT, which is the optimizer that the JVM uses) can change the code around so that the code in the Helper constructor occurs after the write to the helper variable. If it does this, then after the constructing thread writes to helper, but before it actually finishes constructing the object,

What Volatile Means in Java

Today, I'm going to talk about what volatile means in Java. I've sort-of covered this in other posts, such as my posting on the ++ operator , my post on double-checked locking and the like, but I've never really addressed it directly. First, you have to understand a little something about the Java memory model. I've struggled a bit over the years to explain it briefly and well. As of today, the best way I can think of to describe it is if you imagine it this way: Each thread in Java takes place in a separate memory space (this is clearly untrue, so bear with me on this one). You need to use special mechanisms to guarantee that communication happens between these threads, as you would on a message passing system. Memory writes that happen in one thread can "leak through" and be seen by another thread, but this is by no means guaranteed. Without explicit communication, you can't guarantee which writes get seen by other threads, or even the order in whic

Date-Race-Ful Lazy Initialization for Performance

I was asked a question about benign data races in Java this week, so I thought I would take the opportunity to discuss one of the (only) approved patterns for benign races. So, at the risk of encouraging bad behavior (don't use data races in your code!), I will discuss the canonical example of "benign races for performance improvement". Also, I'll put in another plug for Josh Bloch's new revision of Effective Java (lgt amazon) , which I continue to recommend. As a reminder, basically, a data race is when you have one (or more) writes, and potentially some reads; they are all to the same memory location; they can happen at the same time; and that there is nothing in the program to prevent it. This is different from a race condition , which is when you just don't know the order in which two actions are going to occur. I've put more discussion of what a data race actually is at the bottom of this post. A lot of people think that it is okay to have a data