Friday, December 18, 2015

Why I don't like System.gc()

I wrote this for internal-to-my-company consumption, but I couldn't find a reasonable place to put it. And then I remembered I had a blog.

The question to me was whether is ever made sense to do a System.gc(), and, more specifically, whether it made sense to do System.gc() followed by System.runFinalization().

I can't claim it is always wrong. It suffers from what some people call the Politician's syllogism: Something must be done, this is something, therefore we must do it. There's no control over GC other than a big red button labeled "System.gc()", so this is always the something that must be done.

In practice, the biggest problem with System.gc() is that what it does is rather unpredictable. In fact, there's no guarantee that it will do anything. In fact, Hotspot has a -XX:DisableExplicitGC flag to stop it from doing anything. So, it can sometimes solve an immediate problem in a system-dependent way, but you are introducing a significant source of overhead with ill-defined behavior into your code, and that behavior can change with a change of flags or a change of runtime.

Using the concurrent collector (our default at G), System.gc() forces a Full, stop-the-world GC instead of a concurrent GC. By calling System.gc(), you are stopping the world for potentially much longer than just letting a concurrent collection happen would. At G, our Full STW GC under CMS is parallel, and the rest of the world's is serial, so it will finish faster for us than it will for everyone else. Except the rest of the world has the parallel GC by default, so it will be parallel. Except that they're changing it to G1 for Java 9, which has a serial full STW GC, so it won't be parallel.

Of course, if you are using the concurrent collector, it matters if the user passes -XX:+ExplicitGCInvokesConcurrent, in which case System.gc() will happen concurrently, which would mean that the application will continue executing, and you will probably call runFinalization() before the GC has enqueued the finalizable objects.

It should be relatively clear at this point that you can't depend on it for performance, because you can't guarantee its performance, and you can't depend on it for correctness, because you have no idea what it is going to do.

If you really want to wait for a GC to occur and for some percentage of objects to be finalized, you probably to do something like:
  • Create a CountDownLatch with a count of 1.
  • Create an object whose finalizer calls countDown().
  • Make the object garbage immediately (bearing in mind that object lifetimes aren't what you think they might be).
  • Call System.gc().
  • Call await() on the CountDownLatch.
  • Call runFinalization to make sure all of the pending finalizers are finished (but do it in a loop with a timeout!)
Which will probably work most of the time, unless System.gc() really doesn't do anything and a GC is never triggered (or, worse, only young-gen GC is ever triggered, so that objects in the old gen aren't collected or finalized, and you think something has been finalized when it hasn't).

In short, usage should be carefully considered.

As for runFinalization: it suffers from the same "I have one button for this" issue. What it does (in Hotspot) is add a thread as a worker for processing finalization, and then return when that thread is done (i.e., the finalization queue is empty, although note that the finalization queue being empty is not the same as all finalizers having been run, since there are two other threads processing them).

How much are you buying by blocking until the finalization queue is empty? Do you really want to stop the application for this? Is it going to do what you think it is going to do? Are you sure the finalizer you care about is enqueued?

Hey, if you have lots of GCs, and expensive finalizers, one could even imagine the finalization queue not ever being empty. This is fairly unlikely, in practice, but would be really, really nasty if it happened.

Anyway, let the buyer beware.  Google tells me that the first couple of hits for "System.gc()" are also nasty warnings, but I had already written this, so there is no harm in posting it anyway.

10 comments:

Unknown said...

There are a few good reasons - on Android, you receive calls when the OS considers you a good target for oom killing - "release memory or I'll kill you". You can flush a cache, but if there's no allocation pressure that memory wont actually be returned to the OS. Waiting for runFinalization to finish is a way of signalling to the OS that you've done what you can to release memory.

There's another non-android case I had, which was a system that implemented Pregel on EC2 - a GC on one worker would halt the whole computation while it ran. With System.GC, I was able to coordinate GC across the whole fleet.

Jeremy Manson said...

@Unknown - There are certainly times to use it. I just don't like it. It's bad code smell, as people say.

The cases where you do want to coordinate with a System.gc() like that - for example, for servers that do full GCs because of CMS failures every N hours, the natural desire to divert traffic from them so that you can clean them up every N-1 hours - System.gc() usually seems to be a band-aid. In practice, you probably want to figure out why you have sporadic runaway GC like that.

You are correct that the issues are different on Android. If nothing else, on Hotspot, running a GC doesn't actually return memory to the system (we sent in a patch for that, but it was rejected).

Andrew Haley said...

You need System.gc() for things like JDK-6913047, which is a native memory leak due to objects with finalizers being moved into the dense prefix of the old gen and never collected. I spent considerable time looking at this, and it's a very hard problem. I did wonder, in the end, if maybe objects with finalizers could be treated specially by the collector.

Jeremy Manson said...

@Andrew: That's a fundamental design flaw with objects using finalizers as the only mechanism to clean up native resources associated with unreachable Java objects. We've encountered it many, many times in servers. Rather than grinding the process to a halt to call System.gc() (which is a terrible solution to this problem), in practice, such APIs should have close() methods. When we can't add our close methods (e.g., to the class libraries), we often tinker with the innards reflectively.

In either case, System.gc() fulfills the exact role there that I'm describing: it's a blunt instrument - a big red button that you push because you have no other choice.

Andrew Haley said...

Oh yes, you're absolutely right: using finalizers as the only mechanism to clean up native resources associated with unreachable Java objects is always a bug. In this case it's an impedance mismatch between the design of the Java Cryptography Extension and that of a session-oriented API such as PKCS #11. Sure, the API has a close() method, but the JCE provider for PKCS #11 shares PKCS #11 sessions (for efficiency, I guess) and recycles them once they are unreachable. I vaguely thought I'd rip out all the sharing and have a 1:1 mapping between Java and PKCS #11 sessions, but some PKCS #11 hardware modules have a very limited number of concurrent sessions available. So we need to share them, but that is not a good match with JCE. Sigh...

Anonymous said...

A bit of a digression, but is there a place where one can get G's custom openjdk builds to play with? Thanks

Jeremy Manson said...

@Anonymous - You have to work at Google. After that, it's easy. :)

Anonymous said...

I can understand it is not right to take the children's bread and toss it to the dogs. For some of us folks who might not make it through the "gates", can the crumbs be available for us to feed on? Would have been nice if there was an openjdk build with your patches. Any likelihood this might be available in future?

Jeremy Manson said...

@Anonymous - I'm sorry if it feels as if we're keeping some special bread to ourselves (and I'm sorry if my initial response seemed dismissive). To be honest, we would happily contribute everything we have to OpenJDK. It saves us (substantial) time and energy when we need to update to a new JDK.

We've contributed a reasonable amount to date. For example, we take a large role in maintaining java.util.concurrent, we contributed the port to little endian PPC, and we contributed a patch to parallelize the CMS initial mark (which is present in JDK 8).

The biggest problem is that OpenJDK maintainers at Oracle are in the unenviable position of having to support every patch that gets submitted, so they need to make sure that they can thoroughly understand our changes. This tends to mean that they are pretty happy to take bug fixes and smaller changes that are easy to review, but cast a pretty skeptical eye on larger, potentially destabilizing changes.

The change I mentioned above - Parallelizing the Full STW pause under CMS - is a large one. I think their lack of interest in it is reasonable, especially given that it is in a codebase they don't really want to support (see JEP 291).

The other possibility is for us to provide our patches in an open source repo. That's a fair amount of work to keep up to date (we would need to update multiple repos whenever we have a change or when the patches need to be rebased to a new JDK), and we don't have a large team or a business case, so it has never really been a priority.

Frankly, even if it involved no work on our part, I would have a lot of doubt as to how much use it would be to provide our patches like that - very few people are going to be brave enough to patch in our random changes and use them for anything meaningful. There isn't a lot of value in our doing it just to satisfy people's curiosity.

Anonymous said...

Jeremy thanks for the feedback. Your points are fair enough. Your blogs have been educative.