Performance Whack-a-Mole

Jonathan Vogel is working through a three-part series on Java performance optimization, and the second installment is the one worth reading.

Part one catalogs eight anti-patterns - O(n²) streams, regex recompilation, String.format() in hot paths, and so on. Useful reference material, but it's moderately well-trodden ground, stuff that's going to be obvious to experienced practitioners. It's a worthwhile checklist, but optimization experts will read it and think "Well, you left out..." pretty easily. But part two is where it gets interesting, because he stops listing problems and starts showing his process.

The setup is simple enough: it's a synthetic order analytics pipeline, with 100,000 orders, JFR for recording, and JMC for analysis. One method consumed 71% of CPU samples - a classic stream-inside-loop generating a million operations per batch for what should have been a single pass. Easy to miss in code review because it doesn't look like a nested loop; it's a stream process, streams are supposed to be efficient because they look cool¹. But it's an obvious performance sinkhole once the flame graph shows you where the heat is.

He fixed it; elapsed time dropped from around 1,000ms to around 400ms. The profile looked completely different. That's a great improvement; there's no mention of an SLA to say whether the improvement is enough, but even so: that's a good fix.

But this is the part that's less often written about: fixing the dominant hotspot doesn't just make the system faster, it changes what the profiler can see. The String.format() calls - 300,000 of them across the run, he says - were always there, always burning CPU, always stretching the GC. They just looked like noise next to a method eating 71% of samples. Remove the noise floor and suddenly they're visible. Same with the autoboxing in FraudScorer, the string concatenation inside a synchronized block in AnalyticsAccumulator. All present from the start; all invisible until the thing drowning them out was gone.

This is optimization as whack-a-mole, which is how it actually works in practice. You rarely walk into a system, identify a bottleneck, fix it, and ship. You find the loudest problem, fix it, and then look again at what that problem was covering. Occasionally you get lucky - a single caching fix on a reflection-heavy path and you're done, maybe - but the more common experience is that the profile has layers and you have to work through them.

Optimization ends up being less about just winning the game and more about winning the season.

The section on thread contention is worth calling out specifically as an example of this: he says the synchronized method in AnalyticsAccumulator showed zero notable contention at 100 virtual threads. At 2,500 virtual threads processing 500,000 orders, it produced 842 monitor contention events and over 16 seconds of aggregate blocked time. That's not a bug that code review finds trivially. It's not a bug that low-concurrency profiling finds. It lives exclusively at load, which is to say it lives in production. JFR's contention tab shows it cleanly, but only if you think to look for it - and only if you're testing at realistic concurrency levels.

The final numbers: 1,198ms to 239ms elapsed, peak heap from 1GB to 139MB, GC pauses from 19 to 4.

What makes this series useful beyond the numbers is that Vogel shows the tooling clearly enough to replicate: how to capture a JFR recording, what to look at first in JMC's overview tab (CPU-bound, allocation-bound, or contention-bound - pick your entry point), and why you fix one thing at a time rather than everything at once. The process is as transferable as the outcomes.

One thing he has not discussed is the concept of the service level agreement, the SLA. If there's a criticism of the series so far, it's that he hasn't really mentioned a target, only a process and results. An SLA would give a measure of the load the system was expected to bear, and the results it should show; for example, he mentions 500,000 orders with 2,500 threads, but not if the system is expected to handle 500,000 orders with 2,500 threads.

If the system is expected to support 20,000 orders at 200 threads, then it's interesting that it can handle the higher load, but it's done at 20,000 and 200. The extra optimization ends up being potentially wasted effort - a hard thing to say, but it's worth considering whether the extra effort is actually contributing or if it's a testament to just trying to write optimal code.

This isn't meant to be criticism of Mr. Vogel's efforts or optimization process - and it's fully possible that this is indeed meant to be a "how far can we take this?" spike project. But in real projects, doneness is an important measure, and that's what an SLA should define: "when can we say we have accomplished our goals?" Instead, we get the impression that "optimization is done when we can't find any more hotspots," when real life doesn't really follow that rule.

Part three hasn't published yet.

"Because they look cool" is unfair. But they're often preferred as the new hotness without a careful understanding of what they're actually doing under the hood; I've had arguments with Python coders about generator efficiency, where they said generators were better because "Python didn't need loops" - when the generator was, in fact, replacing a loop. It just looked better.
↩

Comments (0)