This blog post will give a cursory overview of the joys and misadventures of micro-benchmarking, and will be the first in an ongoing treatment of the topic as the author deepens her knowledge over the coming months. It is part of the Anaplan Engineering technical blog series, written by engineers for the enjoyment of engineers.
- Someone who converts Curly Wurlys and Coke Zero into code.
- Typically, someone who has an unhealthy obsession with writing clean, testable code.
Beyond all the temptation-bundling of Curly Wurlys, Coke Zero, and code, Anaplan’s London office is a pretty great place to be. As a relatively new junior engineer at Anaplan, what an eventful few months it has been since joining Anaplan’s newest engineering centre which started out in a basement in Soho. It has been a fun and crazy journey watching the London team grow up, and something I feel very proud to have played a hand in.
So, what’s inside the Anaplan machine? Anaplan developers are a vibrant and diverse bunch who are way too school for cool. We attract all sorts – economists, teachers, theoretical physicists, and of course, hardcore veteran programmers. Simply, they’re inventors and though I don’t generally believe in multi-tasking, they are innate multi-taskers. Because they do it all – they write weapons-grade robust code, they abide by the good boy scout principle, they innovate, they build teams, they mentor, and they share their hard-won wisdom. My keen observations thus far of this rare breed is that they are masterful solo predators who seek to exterminate any and all code smells and anti-patterns and feed on niche JVM knowledge. However, they also thrive in packs and once having formed a group, can crank out well-crafted code features at impressive speed.
The diet of an Anaplan Engineer can vary within the pack but staples are usually core Java, mental acrobatics, and some mechanical sympathy. They are unassuming Java heroes walking around in jeans and oversized t-shirts (in some cases, T-shirts with funny software quips). They also hail from very different geographical regions – I myself have made the journey across the Irish Sea where I previously set up sticks in Dublin, and now find myself surrounded by folks from South Africa, India, Italy, Bahrain, Malaysia, and Russia. As one of the more junior coders, I have been exposed to some very powerful, hugely generative coding gurus on the team. They really do eat distributed systems for breakfast!
On top of all this, they are also a very intellectually generous bunch and as an overwhelmed new starter feeling exceedingly imposter-like, one of my South African allies (whose previous incarnation was as a military pilot) reminded me one rainy Monday evening, when the code productivity was low and the fear was high, that it was alright to surface above the clouds and signal for help. And so, I got my own Air Traffic Controller.
However, I digress. This blog does not aim to chronicle the colourful everyday Java heroes and heroines that make up Anaplan. Rather, it will serve a brief overview of some of the things I learned in my first spike (a research/discovery task in an Agile sprint) into an area of Java programming that is veiled in much dark magic indeed – micro-benchmarking.
Three hard things
So, it turns out there are actually three hard things in computer science. Phil Karlton, former Principal Curmudgeon at Netscape, asserted in c.1996 that there were two hard things in computer science: cache invalidation and naming things. I would add a third hard thing: micro-benchmarking.
What do we mean by micro-benchmarking? Micro-benchmarking is simply measuring the performance of a small piece of code. It is where sound methodology meets statistical prowess meets deep JVM knowledge. All three components are important to empirically evaluate the performance of a piece of code; otherwise, garbage in, garbage out. Your best bet for successful micro-benchmarking is to kidnap an Aleksey Shipilev or a Brian Goetz. But if that’s not an option, how can a humble programmer write a good Java micro-benchmark? More importantly, why is it useful?
Well, let’s start by declaring that writing a good benchmark is not a skill one is born with. It requires a comprehensive knowledge of the innards of the JVM, including JIT compilation and memory management, as well as an appreciation of rigorous methodologies for evaluating your program.
Micro-benchmarking can be used for comparing two implementation approaches for some piece of logic to determine which one is more performant, to help find and optimize bottlenecks in our system, and to safeguard against performance regression in critical parts of our system. The latter use case becomes crucial when you’re constantly churning out new code in an application – code that may lead to “death by a thousand cuts” in the performance of the entire application.
Why do bad micro-benchmarks happen to good programmers?
Micro-benchmarking is a fickle mistress, made that way in large part due to the optimizations and subsequent code transformations that the JIT compiler applies to our code. Dynamic compilation also introduces further complexities for micro-benchmarking. Like it says on the tin, the HotSpot JVM dynamic compilation involves interpretation and profiling to detect the “hot spots” in our code (hot spots are simply parts of your code that get invoked regularly). The HotSpot JVM works by the tenet that these “hot” code paths are compiled to machine code after a certain number of executions. Conversely, infrequently exercised code paths are run in interpreted mode. Why is it important to be aware that the HotSpot is a mixed-mode system for micro-benchmarking? I’m glad you asked. If our benchmark is testing code in a loop, we may end up measuring the code running in interpreted mode and then incur some compilation cost, rather than measuring the raw performance of the code under test itself. Thus, we need to be careful when writing our benchmarks that we are only measuring the thing that we purport to measure. To account for dynamic compilation, we can perform a “warm-up,” which allows your program to run a few times before you start taking your measurements. Happy days.
Another gotcha in the murky underworld of micro-benchmarking is a common optimization that the JIT compiler will apply to your code: dead code elimination (DCE). If your test involves a piece of code that does not do anything with the result of its computation, you are once again at the mercy of the JIT. The compiler can partially or fully prune your code in this scenario, giving the false impression that your test is performing very fast. This can be seductive, but don’t be fooled. One way to avoid this code elimination is to make the result of your computation non-trivial in some way – by returning the value, for example.
It’s important to be cognizant of optimisations like DCE especially when comparing two approaches for solving something, as it may be the case that one approach is more susceptible to certain types of optimisations than others. There is a host of compiler-induced assembly jump-killers such as method inlining (where a method call is removed and the method body is instead inserted into the callsite), constant folding (predictable inputs inserted directly into code), loop unrolling (where loops are unrolled to reduce the number of branches), and null check elimination. What’s more, not only can the compiler optimise, but it can also deoptimise based on whether its speculations were wrong.
A further complication when writing benchmarks is garbage collection. In real-world code, objects have a mortality rate, and invariably GC will kick in. This is hard to predict, and can be something that introduces another extraneous variable when analysing the performance of code. We can either avoid allocating in our tests, which may not be feasible, or we can run our benchmark long enough so that it’s in a garbage collection steady-state; each execution should spend similar time on GC.
Micro-benchmarks … it’s not me, it’s you
Okay, perhaps we are mutually responsible. Perhaps we should retreat to the nearest forest, grow our hair long, and pray to the cruel, sporting micro-benchmarking Gods? A less operatic solution might be looking at tools like Google Caliper or Java Micro-benchmarking Harness (JMH) that can help the programmer to dodge some of the pitfalls inherent in micro-benchmarking. In particular, during my research spike I was able to take some time to try out JMH (see JMH Samples). This is a powerful toolkit developed by the masterminds behind the JIT. Using JMH, we can specify “benchmark modes” to tell JMH which metric we are interested in evaluating our test according to – throughput, average time, sample time, single-shot (cold start, no warm-up), or all of the above.
We can also enforce some control over the compiler to state whether we want methods to be inlined or not. As a salve for our dead code elimination problems, we can use JMH’s special Blackhole class that prevents this issue. Using JMH allows us to quickly measure the performance of our code, and a project can be set up with minimal effort using the Maven archetype or with the Gradle plugin. There is support for JMH in IDEs but the recommended approach is via command-line as there may be some uncontrolled environment factors using an IDE. Tools like JMH help us avoid problems that we don’t know we have—for example, false sharing—by coming pre-baked with safeguards against common benchmarking trip ups.
Bottom line? Benchmarking is the head-on battle against compiler magic. The dilemma is probably best captured by Dr. Cliff Click, one of the guys who used to write JIT compilers at Sun Microsystems, who said, “Put microtrust in microbenchmarks.” Micro-benchmarking confronts us with the Heisenberg principle – we can’t rule out the fuzziness surrounding the very thing we are trying to measure, but using sophisticated tools like JMH can certainly help us avoid common missteps. Yet, it cannot be a panacea for all our benchmarking woes, and it is still necessary for the programmer to know how to write an effective benchmark. Learning to write a good benchmark takes practice. Luckily, the folks at Anaplan engineering know this, and have allowed unsuspecting and uninitiated engineers, like myself, to embrace this challenge. Creating benchmarks is something we will continue to chisel away at. My first foray into the world of micro-benchmarking has provided some fascinating insights into the internals of the JVM, some useful lessons on how not to micro-benchmark, added some fancy words into my JVM lexicon (hello, “megamorphic dispatch”), and helped me inch closer to that elusive programmer life goal of performance measurement.
It is always sage not to divorce ourselves too much from real world code when experimenting with JMH, and the performance characteristics of our real Anaplan code as it will be optimized in production. As an Anaplan engineer, the overarching goal of my micro-benchmarking exercises will be to hunt down the bottlenecks in our system and carry out some performance measurements to improve quality of life for all. There will never be a substitute for measuring the performance of a real application, and the common use cases – questions around which we don’t have the have right to ask just yet. High altitude indeed for any engineer, but hey, what doesn’t kill you makes you stranger, right? I know just the team to do it.
References and further reading
Evans, B. J, & Gough, J. (2016). Optimizing Java. Sebastopol, CA: O’Reilly Media, Inc. http://shop.oreilly.com/product/0636920042983.do
Goetz, B. (2004). Dynamic compilation and performance measurement. IBM developerWorks. Retrieved from: https://www.ibm.com/developerworks/library/j-jtp12214/
Goetz, B. (2005). Java theory and practice: Anatomy of a flawed microbenchmark. IBM developerworks. Retrieved from: https://www.ibm.com/developerworks/library/j-jtp02225/
Oaks, S. (2013). Java Performance: The Definitive Guide. Sebastopol, CA: O’Reilly Media, Inc. http://shop.oreilly.com/product/0636920028499.do
OpenJDK JMH reference: http://openjdk.java.net/projects/code-tools/jmh/