The benefit of using this interface to write a profiler is that it can gather stack traces asynchronously. This avoids the inaccuracies of only being able to profile code at safe points, and the overhead of having to stop the JVM to gather a stack trace. The result is a more accurate profiler that avoids the 10-20% overhead of something like hprof. For a more thorough explanation as to why this is interesting, see my 2010 writeup on the subject.
Another feature of the profiler is that it will tell you what stack traces are being executed, rather than, like many sampling Java profilers, what stack traces could be executed. That is, with traditional profilers, if you stop the VM and ask it for stack traces, it will tell you all of the stack traces from all of the threads that are not currently blocked. This profiler tells you the stack trace from the thread that got interrupted when you asked for the stack trace. The tradeoff is reasonably important if you are trying to think about thread starvation and scheduling; with this approach, you get to see what isn't being scheduled. With the "get everything" approach, you get more samples, so there is a meaningful benefit to both approaches.
Code is here. I'm not likely to invest a great deal of time in making it wonderful, but patches from interested parties are welcome.
Edited to add: Some of the commenters below wonder why the stack trace grabber can't print out the stack traces when it gets them. The answer to this lies in what I meant when I called this function asynchronous. I was a little (deliberately) vague on what "asynchronous" means. When I say that the timer that goes off to grab a stack trace interrupts a running thread asynchronously, I mean that it can interrupt the thread at any point. The thread can be in the middle of GC, or the middle of adding a frame to the stack, or in the middle of acquiring a lock. The reader can imagine that writing something asynchronous-safe is somewhat more tricky than writing something thread-safe: not only do you have to worry about what other threads are doing, you also have to worry about what the current thread is doing.
Needless to say, there are a lot of things you can't (or shouldn't) do asynchronously. For example, you can't call malloc, because if the thread is already in the middle of executing malloc when you interrupt it, you can destroy malloc's internal data structures. In fact, for the standard C library functions, the POSIX standard specifies a list of functions that have to be async-safe (see man 7 signal for the list, if you are curious). Note that functions like printf are not async-safe, so you can't call them asynchronously. As far as I know, AsyncGetCallTrace is pretty much the only non-trivial async-safe JNI/JVMTI call. Others, including the standard JVMTI stack trace gathering function and the functions that make sense out of what AsyncGetCallTrace returns, are not async-safe, and are likely to crash your program if you try to call them asynchronously.
For this programming exercise, I was lazy, and printed the stack traces at VM exit. That doesn't mean that a sufficiently clever approach can't get the stack traces incrementally. A separate, synchronous thread can read the data that was written out asynchronously. You can't use a standard blocking lock to protect the data, though, because standard lock implementations aren't async-safe. You could use a spin lock, but you would have to be careful about the async handler interrupting a thread that holds the spin lock. In that case, the async handler won't be able to acquire the lock, so it would have to do a trylock and give up if it can't complete the acquire. There are also clever things you can do with atomic operations if you apply some creativity; I'll leave those as an exercise for the reader.
Comments
Having some trouble building on a Mac. Tried with default Apple g++-4.2 and it failed with a large number of "unsupported options". Installed MacPorts gcc48 and it errors with:
lightweight-java-profiler-read-only/src/globals.h:56:16: error: typedef 'VerifySizesAreEqual' locally defined but not used [-Werror=unused-local-typedefs]
typedef char VerifySizesAreEqual[sizeof(Dest) == sizeof(Source) ? 1 : -1];
^
cc1plus: all warnings being treated as errors
What version did you build with?
Having said that, I've fixed that particular error, so feel free to try again.
It's good to see somebody has figured AGCT out. I'm the engineer behind its use in Sun Studio Analyzer. Though that is now a thing of the past. One tidbit: having been implemented on top of essential low level mechanisms in HotSpot, AGCT becomes naturally available on every platform HS is ported to.
It isn't that a lot of stack samples are needed. It's that you only need a small number, but the ones you do get need to be examined carefully, not just blindly summarized.
Thx for this it looks interesting !
I've compiled it and passed it as a Java option, but it looks like in my case stack traces are only gathered when the VM is shut down. Until the VM shutdown is requested, the stacks.txt file stays empty. Is there an other way to trigger the stack collection ? I've tried with OpenJDK 1.7.0.25 provided as a Fedora RPM.
I also needed to make these small modifications to compile on Fedora 19, GCC 4.8.1 :
One is to compile it, the other is regarding the stacks.txt file path which is apparently missing a '/' between the path and the filename.
Let me know,
Cheers,
Romain.
Index: Makefile
===================================================================
--- Makefile (revision 5)
+++ Makefile (working copy)
@@ -8,7 +8,8 @@
CC=g++
AGENT=liblagent.so
LIBS=-ldl
-BITS?=32
+#BITS?=32
+BITS?=64
BUILD_DIR ?= $(shell mkdir build-$(BITS) 2> /dev/null ; echo build-$(BITS))
SRC_DIR:=${PWD}/src
OPT?=-O2
Index: src/entry.cc
===================================================================
--- src/entry.cc (revision 5)
+++ src/entry.cc (working copy)
@@ -1,6 +1,7 @@
#include
#include
#include
+#include
#include
@@ -191,6 +192,7 @@
exit(0);
}
size_t pathlen = strlen(path);
+ strncat(path, "/", PATH_MAX - pathlen);
strncat(path, kDefaultOutFile, PATH_MAX - pathlen);
Globals::OutFile = fopen(path, "w+");
}
> may require various patches. In fact, I have no idea if
> it will even work!
>
> Having said that, I've fixed that particular error, so
> feel free to try again.
Sorry for the late reply - no notification:(
Anyway, thanks, I did actually get further on my own to the point where it was building the library, but it failed with a segfault immediately. I haven't had time to revisit it since, but we have some performance testing that needs exactly this kind of tool, so I'll be trying again on a Linux box soon.
Thanks for sharing the code.
I have been trying to implement a real time version of it, but encountered an issue : calling jvmti->GetMethodName in the handler of the SIGPROF signal ends up triggering a segmentation fault. Any leads on why it is safe to call it on vm death, but not while executing?
I'll update the post with more detail on what "asynchronous" implies in terms of what you can invoke and what you can't, since that seems to be a source of confusion.
I've wanted this for a LONG time.
My thinking is something along the lines of continuous integration and continuous delivery and I'm been cautiously referring to it as continuous profiling.
Basically the idea that you asynchronously gather stack traces from a running app and build up a profile on a production application.
This way your app developers can easily see your profile at any time.
Maybe having a web interface or API for obtaining the profile data for visualization.
This might complement the tools that I developed for continuous profiling (www.spf4j.org). Serializing the stack samples to protobuf format supported by spf4j is all that needs to be done...
I have recently made some progress in visualizing the samples and improved on brendan greg's flame graphs(http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/) see detail in my latest blog: http://blogs.zoltran.com.
actually
keep up the good work!
One thing is bugging me though: Profiler::Handle seems to sample just the thread that happened to be interrupted by the signal. Am I missing something? Is there some underlying mechanism at the OS level that distributes the signals evenly across the JVM's threads?
Regarding the incremental part, I think we can use two buffers and an atomic indicator. Let the async writer to swap the buffers periodically. Do you think that will work?
Also, there is no native stack, do you have any hint on implementing this?
1) Thanks.
2) Having two arrays and a CAS is the right kind of thing. You have to figure out what to do with the first array once you've switched to the second, because you have to decode it synchronously.
3) We solved the native method issue internally by adjusting the JVM to return the PCs for native frames. That's not such a great solution for tools vendors, although it is fine if you control the JVM. We're planning on OSSing that patch in a few months (once we bring it up to date with JDK8/Hotspot 25).
If you can't do that, one possibility is to grab native stack frames at the same time and use some heuristics to line them up with the Java frames.
I've developed a profiler with asyncGetCallTrace which is triggered by SIGPROF just like yours. But I implemented some JNI function like FindClass and GetStaticMethodID in signal handler. It turns out that the target program crashed once the profiler attached. What I found in hs_err_pid.log is that the crash course is always signal SIGFPE or SIGSEGV (depends on what JNI call I invoke).
I couldn't find out why this happens, could you kindly explain it?
Many functions are not async safe. The problem is that the main program could be in the middle of modifying the same data that the function modifies. Functions that acquire and release locks and alter a shared heap - like JNI functions - are very likely not to be async safe.
There's a Linux man page that discusses this in some depth, and gives a list of libc functions you can call in signal handlers:
https://man7.org/linux/man-pages/man7/signal-safety.7.html