This is a followon to a blog post I made in 2007 about using signal handling to profile Java apps. I mentioned in another post why this might be a good idea, but I wanted to expand on the theme.
Hiroshi Yamauchi and I are giving a talk at this year's JavaOne. Part of this talk will be about our experiences writing profilers at Google, and in preparing for it, I realized my previous entry on this subject only told half the story.
The topic of the original post is that there is an undocumented JVMTI call in OpenJDK that allows you to get a stack trace from a running thread, regardless of the state of that thread. In Unix-like systems, you can use a SIGPROF signal to call this function at (semi-)regular intervals, without having to do anything to your code. The followon post, intended to describe why you would use this, described a couple of things that are true:
Sampling profilers that take thread dumps from individual threads often do so by injecting thread stack trace gathering code into your code. This is suboptimal in many ways:
(Another interesting point, as made by Todd Mytkowicz and Amer Diwan of the University of Colorado: since JIT behavior really depends on everything in the system, and most profilers are part of the system, the decision about where to put a safe point will end up depending on which profiler you are using. This can make the results of the profilers clash violently: because of their differing behaviors, the safe points end up in different places, and the profilers end up tracking different places. See Mytkowicz and Diwan's recent PLDI paper for details.)
All of this JIT talk basically mirrors the Heisenberg Uncertainty Principle, but for profiling. You can either have exact information, or you can have your code behave as it is supposed to behave, but not both. You need to minimize the effects, so you want a profiler that doesn't interfere with or depend on JIT decisions at all. AsyncGetCallTrace fits this bill - you call it, and it returns a result. You don't call it directly from your code. It doesn't wait for a safe point. It doesn't change your code. The JIT doesn't care.
ETA: I believe that the profiler bundled with Sun Studio uses the AsyncGetCallTrace method call, but I'm not exactly sure how.
Hiroshi Yamauchi and I are giving a talk at this year's JavaOne. Part of this talk will be about our experiences writing profilers at Google, and in preparing for it, I realized my previous entry on this subject only told half the story.
The topic of the original post is that there is an undocumented JVMTI call in OpenJDK that allows you to get a stack trace from a running thread, regardless of the state of that thread. In Unix-like systems, you can use a SIGPROF signal to call this function at (semi-)regular intervals, without having to do anything to your code. The followon post, intended to describe why you would use this, described a couple of things that are true:
- That it tells you what is actually running on your CPU, not just what might be scheduled, which is what profilers that take a snapshot of every running thread do. This increases its accuracy dramatically. For example, there are many profilers that report a great deal of your time spent in blocking mechanisms, simply because threads frequently become scheduled as they exit blocking functions. However, those sorts of threads are all "ready to be scheduled" rather than "actually running". As a result, CPU time is charged to those threads inaccurately.
- That it reports time spent in GC and in other VM-specific activities (like JIT compiling). At Google, we find that substantial amounts of our users' CPU time is spent in VM activities. In fact, GC brings the occasional Google service to its knees (not mentioning any names).
Sampling profilers that take thread dumps from individual threads often do so by injecting thread stack trace gathering code into your code. This is suboptimal in many ways:
- It introduces overhead into your code. This is the most obvious way. As soon as you introduce overhead, you are changing the performance characteristics.
- Changing the size of the code changes the optimizing decisions made by the JIT. For example, one of the most important performance improvements made by a just-in-time compiler is when it inlines small methods. It won't inline if the method gets too large. Introducing sampling into the methods changes the size, and therefore affects inlining decisions, changing the performance characteristics of your code.
- Changing the size of the code affects code layout. For relatively obvious reasons of caching and memory alignment, the placement of your code can affect its performance. Bigger code often means worse performance.
(Another interesting point, as made by Todd Mytkowicz and Amer Diwan of the University of Colorado: since JIT behavior really depends on everything in the system, and most profilers are part of the system, the decision about where to put a safe point will end up depending on which profiler you are using. This can make the results of the profilers clash violently: because of their differing behaviors, the safe points end up in different places, and the profilers end up tracking different places. See Mytkowicz and Diwan's recent PLDI paper for details.)
All of this JIT talk basically mirrors the Heisenberg Uncertainty Principle, but for profiling. You can either have exact information, or you can have your code behave as it is supposed to behave, but not both. You need to minimize the effects, so you want a profiler that doesn't interfere with or depend on JIT decisions at all. AsyncGetCallTrace fits this bill - you call it, and it returns a result. You don't call it directly from your code. It doesn't wait for a safe point. It doesn't change your code. The JIT doesn't care.
ETA: I believe that the profiler bundled with Sun Studio uses the AsyncGetCallTrace method call, but I'm not exactly sure how.
Comments
what is your opinion about profilers that are able to integrate into the cpu to read out performance monitors available there. E.g. like Intel VTune.
My understanding is that the hardware performance counters for VTune are associated with a particular method, not a stack trace / call graph. AFAICT - I'm neither a VTune developer or user - VTune's stack trace / call graph tracking mechanism uses JVMTI to be informed when a method is invoked, and then VTune keeps track of that information internally (that approach is accurate, but slooooow).
For the hardware counters, they use a different mechanism to associate hardware counters with specific methods very accurately and (I assume) quickly.
We've found strange problem in our real system
http://forums.sun.com/thread.jspa?threadID=5443540
Could you, please, explane this strange performance degradation and advise problem solution?
what about VisualVM Sampler plugin? As I understand, it uses some sort of sampling, both stacktraces and heap dumps.
Did not find official page, but here's an announcement from author: http://java.dzone.com/announcements/visualvm-12-great-java
If yes,can u say more about the cpu & memory profile how to realize and give some screenshots?
I was really impressed with that Mytkowicz paper when it came it some months ago. It was brilliant in more ways than you mentioned. In particular, the whole way that they define actionable profiles is spot on. Also, how they introduce deliberate delays in methods to test the hypothesis of what the profiler thinks are hot methods is way cool.
@anonymous - we're working on it (it's a background task). We do suffer the problem of regular sampling, because our sampler samples once every 10ms without adding a jitter. We really should fix that, though.
I am trying to understand if I can run a false sharing Java program and still look at the thread id's, cache lines to demonstrate false sharing. Are you aware of any other tools apart from intel's ?
I don't know of any publicly available tools, other than Intel's, to do hardware level profiling on a per-thread basis.
I guess that's not so helpful. Sorry!