Monday, August 2, 2010

Why Many Profilers have Serious Problems (More on Profiling with Signals)

This is a followon to a blog post I made in 2007 about using signal handling to profile Java apps. I mentioned in another post why this might be a good idea, but I wanted to expand on the theme.

Hiroshi Yamauchi and I are giving a talk at this year's JavaOne. Part of this talk will be about our experiences writing profilers at Google, and in preparing for it, I realized my previous entry on this subject only told half the story.

The topic of the original post is that there is an undocumented JVMTI call in OpenJDK that allows you to get a stack trace from a running thread, regardless of the state of that thread. In Unix-like systems, you can use a SIGPROF signal to call this function at (semi-)regular intervals, without having to do anything to your code. The followon post, intended to describe why you would use this, described a couple of things that are true:

  • That it tells you what is actually running on your CPU, not just what might be scheduled, which is what profilers that take a snapshot of every running thread do. This increases its accuracy dramatically. For example, there are many profilers that report a great deal of your time spent in blocking mechanisms, simply because threads frequently become scheduled as they exit blocking functions. However, those sorts of threads are all "ready to be scheduled" rather than "actually running". As a result, CPU time is charged to those threads inaccurately.

  • That it reports time spent in GC and in other VM-specific activities (like JIT compiling). At Google, we find that substantial amounts of our users' CPU time is spent in VM activities. In fact, GC brings the occasional Google service to its knees (not mentioning any names).

I now realize that I left all of the JIT compiler-related reasons out of that post. This post is an attempt to repair that problem, and describe why profiling with AsyncGetCallTrace and signals is a better approach than the typical state of the art.

Sampling profilers that take thread dumps from individual threads often do so by injecting thread stack trace gathering code into your code. This is suboptimal in many ways:

  • It introduces overhead into your code. This is the most obvious way. As soon as you introduce overhead, you are changing the performance characteristics.

  • Changing the size of the code changes the optimizing decisions made by the JIT. For example, one of the most important performance improvements made by a just-in-time compiler is when it inlines small methods. It won't inline if the method gets too large. Introducing sampling into the methods changes the size, and therefore affects inlining decisions, changing the performance characteristics of your code.

  • Changing the size of the code affects code layout. For relatively obvious reasons of caching and memory alignment, the placement of your code can affect its performance. Bigger code often means worse performance.

At this point, you might be thinking that the best thing to do is use something like Thread.getAllStackTraces and take your chances with its inaccuracy (which is what the aforementioned sampling profilers do). However, there is one more factor, which is significant in any of the documented and built-in stack trace sampling methods - that system-wide stack trace sampling happens for any given thread when it is at a safe point. Safe points are places in the code that the VM knows it can do a whole host of things - like initiate garbage collection - safely. The location of these safe points is determined by the JIT. It often puts them in places that aren't ideal for CPU profiling. For example, there may be a hot loop in your code that the JIT decides should not be interrupted by a safe point. If you use most standard profilers, this hot loop will never get profiled! As a result, the placement of safe points affects the sampling quality of standard sampling profiling techniques.

(Another interesting point, as made by Todd Mytkowicz and Amer Diwan of the University of Colorado: since JIT behavior really depends on everything in the system, and most profilers are part of the system, the decision about where to put a safe point will end up depending on which profiler you are using. This can make the results of the profilers clash violently: because of their differing behaviors, the safe points end up in different places, and the profilers end up tracking different places. See Mytkowicz and Diwan's recent PLDI paper for details.)

All of this JIT talk basically mirrors the Heisenberg Uncertainty Principle, but for profiling. You can either have exact information, or you can have your code behave as it is supposed to behave, but not both. You need to minimize the effects, so you want a profiler that doesn't interfere with or depend on JIT decisions at all. AsyncGetCallTrace fits this bill - you call it, and it returns a result. You don't call it directly from your code. It doesn't wait for a safe point. It doesn't change your code. The JIT doesn't care.

ETA: I believe that the profiler bundled with Sun Studio uses the AsyncGetCallTrace method call, but I'm not exactly sure how.


Noel Grandin said...

Any chance this profiler work will show up in an open-source project?

Jeremy Manson said...

We're considering it. The problem is that our team is very good at figuring out JIT issues and POSIX semantics, but not so great with the user interfaces. :)

pveentjer said...

Hi Jeremy,

what is your opinion about profilers that are able to integrate into the cpu to read out performance monitors available there. E.g. like Intel VTune.

Jeremy Manson said...

Hi Peter -

My understanding is that the hardware performance counters for VTune are associated with a particular method, not a stack trace / call graph. AFAICT - I'm neither a VTune developer or user - VTune's stack trace / call graph tracking mechanism uses JVMTI to be informed when a method is invoked, and then VTune keeps track of that information internally (that approach is accurate, but slooooow).

For the hardware counters, they use a different mechanism to associate hardware counters with specific methods very accurately and (I assume) quickly.

kontiky said...

Hello, Jeremy!
We've found strange problem in our real system
Could you, please, explane this strange performance degradation and advise problem solution?

Unknown said...

Hi Jeremy,
what about VisualVM Sampler plugin? As I understand, it uses some sort of sampling, both stacktraces and heap dumps.
Did not find official page, but here's an announcement from author:

Jeremy Manson said...
This comment has been removed by the author.
Jeremy Manson said...

@Palo - VisualVM uses the same, documented methods, and has the same limitations (Sorry about my earlier post - I misread your comments).

bluedavy said...

does u mean google wrote a profiler for java just like google perf-tools,it can be used for online app,and do cpu profile and memory profile?
If yes,can u say more about the cpu & memory profile how to realize and give some screenshots?

Anonymous said...

Jeremy: I want to second the plea for you guys to open source your profiler. Please do this even if the Ui is not a slick GUI--I need the functionality now.

I was really impressed with that Mytkowicz paper when it came it some months ago. It was brilliant in more ways than you mentioned. In particular, the whole way that they define actionable profiles is spot on. Also, how they introduce deliberate delays in methods to test the hypothesis of what the profiler thinks are hot methods is way cool.

Jeremy Manson said...

@kontiky - That looks like a pretty poorly written microbenchmark. I would suggest you look at Cliff Click and Brian Goetz's various talks on how to write microbenchmarks in Java - it is kind of a black art.

@anonymous - we're working on it (it's a background task). We do suffer the problem of regular sampling, because our sampler samples once every 10ms without adding a jitter. We really should fix that, though.

Jeremy Manson said...

@BlueDavy - Yup, we use profiler that works internally with our pprof / Google perftools framework. Screenshots are pretty pointless, since they are just as ugly as the perftools stuff. However, no one is doing the updates on the public perftools framework to make it work with Java.

Mohan Radhakrishnan said...

Do you use Oracle studio analyzer ? I think you mention it in your other blog post.

I am trying to understand if I can run a false sharing Java program and still look at the thread id's, cache lines to demonstrate false sharing. Are you aware of any other tools apart from intel's ?

Jeremy Manson said...

I don't currently use Oracle Studio, although I have tried it in the past.

I don't know of any publicly available tools, other than Intel's, to do hardware level profiling on a per-thread basis.

I guess that's not so helpful. Sorry!

Mike Dunlavey said...

The Mytkowicz and Diwan paper really bothers me. For example, it's definition of "hotness" seems to mean "self time percent". The whole reason gprof was invented 30 years ago was that self time is an inadequate diagnostic. Another way it bothers me is that it concentrates on methods, rather than lines of code. Yet another way is its shallow understanding of sampling statistics. Look here:

Jeremy Manson said...

I think it does a good job of summarizing some of the problems of the existing profilers, though, which is why I reference it.