Tuesday, July 9, 2013

Lightweight Asynchronous Sampling Profiler

I finally got off of my lazy elbows and wrote a proof-of-concept sampling profiler for public consumption based on the Hotspot JVM's AsyncGetCallTrace interface.

The benefit of using this interface to write a profiler is that it can gather stack traces asynchronously. This avoids the inaccuracies of only being able to profile code at safe points, and the overhead of having to stop the JVM to gather a stack trace. The result is a more accurate profiler that avoids the 10-20% overhead of something like hprof. For a more thorough explanation as to why this is interesting, see my 2010 writeup on the subject.

Another feature of the profiler is that it will tell you what stack traces are being executed, rather than, like many sampling Java profilers, what stack traces could be executed. That is, with traditional profilers, if you stop the VM and ask it for stack traces, it will tell you all of the stack traces from all of the threads that are not currently blocked. This profiler tells you the stack trace from the thread that got interrupted when you asked for the stack trace. The tradeoff is reasonably important if you are trying to think about thread starvation and scheduling; with this approach, you get to see what isn't being scheduled. With the "get everything" approach, you get more samples, so there is a meaningful benefit to both approaches.

Code is here. I'm not likely to invest a great deal of time in making it wonderful, but patches from interested parties are welcome.

Edited to add: Some of the commenters below wonder why the stack trace grabber can't print out the stack traces when it gets them. The answer to this lies in what I meant when I called this function asynchronous. I was a little (deliberately) vague on what "asynchronous" means. When I say that the timer that goes off to grab a stack trace interrupts a running thread asynchronously, I mean that it can interrupt the thread at any point. The thread can be in the middle of GC, or the middle of adding a frame to the stack, or in the middle of acquiring a lock. The reader can imagine that writing something asynchronous-safe is somewhat more tricky than writing something thread-safe: not only do you have to worry about what other threads are doing, you also have to worry about what the current thread is doing.

Needless to say, there are a lot of things you can't (or shouldn't) do asynchronously. For example, you can't call malloc, because if the thread is already in the middle of executing malloc when you interrupt it, you can destroy malloc's internal data structures. In fact, for the standard C library functions, the POSIX standard specifies a list of functions that have to be async-safe (see man 7 signal for the list, if you are curious). Note that functions like printf are not async-safe, so you can't call them asynchronously. As far as I know, AsyncGetCallTrace is pretty much the only non-trivial async-safe JNI/JVMTI call. Others, including the standard JVMTI stack trace gathering function and the functions that make sense out of what AsyncGetCallTrace returns, are not async-safe, and are likely to crash your program if you try to call them asynchronously.

For this programming exercise, I was lazy, and printed the stack traces at VM exit. That doesn't mean that a sufficiently clever approach can't get the stack traces incrementally. A separate, synchronous thread can read the data that was written out asynchronously. You can't use a standard blocking lock to protect the data, though, because standard lock implementations aren't async-safe. You could use a spin lock, but you would have to be careful about the async handler interrupting a thread that holds the spin lock. In that case, the async handler won't be able to acquire the lock, so it would have to do a trylock and give up if it can't complete the acquire. There are also clever things you can do with atomic operations if you apply some creativity; I'll leave those as an exercise for the reader.

24 comments:

Adam Hawthorne said...

Hi Jeremy,

Having some trouble building on a Mac. Tried with default Apple g++-4.2 and it failed with a large number of "unsupported options". Installed MacPorts gcc48 and it errors with:

lightweight-java-profiler-read-only/src/globals.h:56:16: error: typedef 'VerifySizesAreEqual' locally defined but not used [-Werror=unused-local-typedefs]
typedef char VerifySizesAreEqual[sizeof(Dest) == sizeof(Source) ? 1 : -1];
^
cc1plus: all warnings being treated as errors

What version did you build with?

Jeremy Manson said...

I've only tested on Linux (Ubuntu Precise). A mac port may require various patches. In fact, I have no idea if it will even work!

Having said that, I've fixed that particular error, so feel free to try again.

Jeremy Manson said...

Oh, and I didn't answer your question: I built with gcc 4.6.3.

Oleg Mazurov said...

Hi Jeremy,

It's good to see somebody has figured AGCT out. I'm the engineer behind its use in Sun Studio Analyzer. Though that is now a thing of the past. One tidbit: having been implemented on top of essential low level mechanisms in HotSpot, AGCT becomes naturally available on every platform HS is ported to.

Jeremy Manson said...

Hi Oleg! Thanks for doing it, because it saved us an enormous amount of time and effort. We should talk some time...

Mike Dunlavey said...

I just posted a comment on your 2010 post. I'm glad you are doing stack sampling. I've had a lot to say on this: http://scicomp.stackexchange.com/a/2719/1262
It isn't that a lot of stack samples are needed. It's that you only need a small number, but the ones you do get need to be examined carefully, not just blindly summarized.

Romain G said...

Hi,

Thx for this it looks interesting !
I've compiled it and passed it as a Java option, but it looks like in my case stack traces are only gathered when the VM is shut down. Until the VM shutdown is requested, the stacks.txt file stays empty. Is there an other way to trigger the stack collection ? I've tried with OpenJDK 1.7.0.25 provided as a Fedora RPM.

I also needed to make these small modifications to compile on Fedora 19, GCC 4.8.1 :
One is to compile it, the other is regarding the stacks.txt file path which is apparently missing a '/' between the path and the filename.

Let me know,
Cheers,
Romain.

Index: Makefile
===================================================================
--- Makefile (revision 5)
+++ Makefile (working copy)
@@ -8,7 +8,8 @@
CC=g++
AGENT=liblagent.so
LIBS=-ldl
-BITS?=32
+#BITS?=32
+BITS?=64
BUILD_DIR ?= $(shell mkdir build-$(BITS) 2> /dev/null ; echo build-$(BITS))
SRC_DIR:=${PWD}/src
OPT?=-O2
Index: src/entry.cc
===================================================================
--- src/entry.cc (revision 5)
+++ src/entry.cc (working copy)
@@ -1,6 +1,7 @@
#include
#include
#include
+#include

#include

@@ -191,6 +192,7 @@
exit(0);
}
size_t pathlen = strlen(path);
+ strncat(path, "/", PATH_MAX - pathlen);
strncat(path, kDefaultOutFile, PATH_MAX - pathlen);
Globals::OutFile = fopen(path, "w+");
}

Adam Hawthorne said...

> I've only tested on Linux (Ubuntu Precise). A mac port
> may require various patches. In fact, I have no idea if
> it will even work!
>
> Having said that, I've fixed that particular error, so
> feel free to try again.

Sorry for the late reply - no notification:(

Anyway, thanks, I did actually get further on my own to the point where it was building the library, but it failed with a segfault immediately. I haven't had time to revisit it since, but we have some performance testing that needs exactly this kind of tool, so I'll be trying again on a Linux box soon.

Alexandre de Champeaux said...

Hi Jeremy,

Thanks for sharing the code.
I have been trying to implement a real time version of it, but encountered an issue : calling jvmti->GetMethodName in the handler of the SIGPROF signal ends up triggering a segmentation fault. Any leads on why it is safe to call it on vm death, but not while executing?

Jeremy Manson said...

@Romain: can you send me the patch via email? Blogger thought the the > and < chars in your post were HTML tags, and stripped them out.

I'll update the post with more detail on what "asynchronous" implies in terms of what you can invoke and what you can't, since that seems to be a source of confusion.

Jeremy Manson said...

I've updated the post with an explanation of why things like printf and JNI/JVMTI invocations crash your programs.

Cesar said...
This comment has been removed by a blog administrator.
burtonator said...

This is really interesting and I might end up diving in here and seeing what I can do to improve this.

I've wanted this for a LONG time.

My thinking is something along the lines of continuous integration and continuous delivery and I'm been cautiously referring to it as continuous profiling.

Basically the idea that you asynchronously gather stack traces from a running app and build up a profile on a production application.

This way your app developers can easily see your profile at any time.

Maybe having a web interface or API for obtaining the profile data for visualization.

Jeremy Manson said...

@burtonator - what you propose is very similar to how we use this technology at Google.

Mike Dunlavey said...

@burtonator: I'm with you all the way up to "build up a profile on a production application". That sounds like some kind of summarization (into call tree, graph, whatever). My take is that the summarization is where the rich information in the samples themselves is lost. If the goal is to find bottlenecks, the high measurement precision you get with lots of samples is being bought at the expense of actually finding bottlenecks to fix, because the measurements don't tell you what's being done that you don't need.

Zoltan Farkas said...

Good work!
This might complement the tools that I developed for continuous profiling (www.spf4j.org). Serializing the stack samples to protobuf format supported by spf4j is all that needs to be done...
I have recently made some progress in visualizing the samples and improved on brendan greg's flame graphs(http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/) see detail in my latest blog: http://blogs.zoltran.com.

actually

keep up the good work!

Zoltan Farkas said...

if somebody would write some code to summarize the stack samples into the protobuffer format: http://code.google.com/p/spf4j/source/browse/trunk/spf4j-core/src/main/proto/proto_sample_nodes.proto, they could be visualized with the spf4j UI: http://code.google.com/p/spf4j/downloads/list... I am throwing this idea out in case somebody has more time than I do...

Jeremy Manson said...

@Zoltan - I'm not likely to have time to do it, so anyone else around here who wants to do it should just let me know.

Arno Haase said...

Great stuff, thanks for sharing!

One thing is bugging me though: Profiler::Handle seems to sample just the thread that happened to be interrupted by the signal. Am I missing something? Is there some underlying mechanism at the OS level that distributes the signals evenly across the JVM's threads?

Jeremy Manson said...

@Arno - Whether this works depends on the OS. It works just fine on Linux, where the OS sends the signal to whatever happens to be on the processor at the time the signal fires. OS X always sends the signal to the main thread, which is annoying - I have a workaround, but I haven't had a chance to debug it yet.

Arno Haase said...

That makes sense. But I was wondering about distribution across cores on multicore CPUs, with JVM threads potentially biased towards or bound to a specific CPU.

Jeremy Manson said...

I recall asking a kernel expert about that at one point, but I don't recall what the answer was. It was certainly good enough for us at Google.

comain said...

Jeremy, good job.
Regarding the incremental part, I think we can use two buffers and an atomic indicator. Let the async writer to swap the buffers periodically. Do you think that will work?

Also, there is no native stack, do you have any hint on implementing this?

Jeremy Manson said...

@comain:

1) Thanks.

2) Having two arrays and a CAS is the right kind of thing. You have to figure out what to do with the first array once you've switched to the second, because you have to decode it synchronously.

3) We solved the native method issue internally by adjusting the JVM to return the PCs for native frames. That's not such a great solution for tools vendors, although it is fine if you control the JVM. We're planning on OSSing that patch in a few months (once we bring it up to date with JDK8/Hotspot 25).

If you can't do that, one possibility is to grab native stack frames at the same time and use some heuristics to line them up with the Java frames.