Using Byteman to detect native memory leaks

In RHQ we use the Augeas library to do the configuration file parsing and updates for us in some of the plugins. Augeas in itself is pretty cool and the language for describing the structure of arbitrary configuration files and howto update them is pretty powerful. The only downside to using Augeas is that it is a C library and we therefore have to bind with it and use it more carefully so that we don’t leak its native resources that aren’t under control of JVM’s garbage collector.

It all boils down to just calling the close() method on the Augeas instance whenever we’re done with it.

As simple as it may seem, we still managed to mess it up and found out that there were some memory leaks that caused the RHQ agent to slowly (or not so slowly depending on its configuration) grow its memory usage which JVM’s maximum heap size couldn’t guard.

The source code of the apache plugin isn’t the simplest and there are many places that invoke augeas which interact in various ways so debugging this all isn’t the simplest task. Even harder, we thought, would be to come up with some unit tests that would make sure that we don’t leak augeas references.

But then a crazy idea entered my mind. I knew Byteman was a tool for bytecode manipulation. My idea was to somehow use it in our tests to do reference counting (by instrumenting the Augeas init() and close() calls). Turns out it is very easy to do that with Byteman and I was able to achieve even more than I hoped for.

Byteman integrates quite nicely with TestNG that we use for our unit tests and so in a couple of steps I was able to implement a reference counter that not only was able to give me a difference between number of augeas instances creates vs. closed BUT it would also give me the stacktraces to the code that created a reference that wasn’t close()‘d afterwards. That I think is absolutely cool.

The rules I added to my tests are quite simple:


@BMRules(
    rules = {
        @BMRule(name = "increment reference count on Augeas init", targetClass = "net.augeas.Augeas",
            targetMethod = "(String, String, int)",
            helper = "org.rhq.plugins.apache.augeas.CreateAndCloseTracker",
            action = "recordCreate($0, formatStack())"),
        @BMRule(name = "decrement reference count on Augeas close", targetClass = "net.augeas.Augeas",
            targetMethod = "close()", helper = "org.rhq.plugins.apache.augeas.CreateAndCloseTracker",
            action = "recordClose($0, formatStack())") })

There indeed is nothing special about them. I tell Byteman to call my helper class’s recordCreate() method whenever Augeas init() is called and to pass in the augeas instance ($0 stands for this in the context of the instrumented method) and a nice callstack. The second rule merely calls recordClose on my helper with the instance of augeas that is being closed and again the callstack.

You can check out the code for my helper class here. As you might have guessed, it’s only a little more than a hashmap where the keys are the augeas instances and values are the callstacks. By processing this map after all the tests are run, I can quite easily figure out if and where we leak native memory.

Advertisements
Posted in Java, RHQ. 2 Comments »

2 Responses to “Using Byteman to detect native memory leaks”

  1. Mahesh Says:

    Very interesting post

  2. Andrew Dinn Says:

    Hi Lukáš,

    Thanks for publishing a very nice use case for Byteman. This sort of solution can be generalised to attack quite a lot of hard debugging problems where you can detect an error condition but cannot always work out what caused it. You record context information each time some activity is initiated and then later on when you detect something is wrong you can use data from the error context to locate the point where you got into this mess.

    I once heard Cliff Click (Hotspot chief architect and Azul JVM guru) explain how he used a related technique to debug concurrent programming problems. He would instrument various routines by hand so they inserted tagged messages into a circular buffer. This would record the last N significant events (identified by the tag) plus an associated data value (typed according to the tag) preceding a detectable error. He used a debugger condition to break when he knew something was wrong and by eyeballing the circular buffer obtained a window back into history. For good measure you can append a thread id to each buffer entry and/or tick counter to buffer entries to help unravel concurrency issues. Of course, using Byteman you can do all this instrumentation and the eyeballing automatically without needing to recompile or redeploy your application. And if you compile the rules which do the buffer writes then they run fast enough to avoid Heisenbug’s law.

    I noticed one detail in your Helper class which is worth commenting on. Your equality method for EqualableWeakReference returns false when the reference being compared against is false. The comment says:

    “There is no telling on what this reference pointed to, so it is not possible to determine equality.”

    Actually, it is perfectly possible to determine equality. Since nothing refers to the reference which has been nilled then it cannot equal the target of the equals call. So, your code is right to return false.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: