5 Operations metrics that Java web developers should care about
Sometimes it’s nice, as a developer, to ignore the world outside and focus in on “I’m going to implement what the specification says I should, and anything else is somebody else’s problem”. And sometimes that’s the right thing to do. But if you really want to make your product better, then real production data can provide valuable insights into whether you’re building the right thing, rather than just building the thing right. Operations groups gather this kind of stuff as part of normal business*, so here are a handful of ops. metrics that I’ve found particularly useful from a Java Web Application point of view. Note that I’m assuming here that you as the developer aren’t actually responsible for running the aplicationp in production – rather, you just want the data so you can make better informed decisions about design and functionality.
How should you expose this information? Well, emailing round a weekly summary is definitely the wrong thing to do. The agile concept of “information radiators” applies here: Big Visible Charts showing this kind of data in real time will allow it to seep into the subconscious of everyone who passes by.
Request & Error rates
How many requests per second does your application normally handle? How often do you get a 5-fold burst in traffic? How often do you get a 50-fold burst? How does your application perform when this happens? Knowing this kind of thing allows you to make better decisions about which parts of the application should be optimised, and which don’t need it yet.
Request rates for one node on a medium-scale CMS. Note the variance throughout the day. The 5AM spike is someone’s crazy web crawler spidering more than it ought to.
Error rates – be they counting the number of HTTP 500 responses, or the number of stack traces in your logs, are extremely useful for spotting those edge-cases that you thought should never happen, but actually turn out to be disappointingly common.
Whoa. What are all those spikes? Better go take a look in the logs…
GC is a very common cause of application slowdowns, but it’s also not unusual to find people blaming GC for problems which are entirely unrelated. Using GC logging, and capturing the results, will allow you to get a feel for what “normal” means in the context of your application, which can help both in identifying GC problems, and also in excluding it from the list of suspects. The most helpful metrics to track, in my experience, are the minimum heap size (which will vary, but should drop down to the same value after each full GC) and the frequency of full GCs (which should be low and constant).
A healthy-looking application. Full GCs (the big drops) are happening about one per hour at peak, and although there’s a bit of variance in the minimum heap levels, there’s no noticeable upwards trend
Request duration is a fairly standard non-functional requirement for most web applications. What tends not to be measured and tracked quite so well is the variance in request times in production. It’s not much good having an average page-load time of 50 milliseconds if 20% of your requests are taking 10 seconds to load. Facebook credit their focus on minimising variance as a key enabler for their ability to scale to the sizes they have.
the jitter on this graph gives a reasonable approximation of the variance, though it’s not quite as good as a proper statistical measure. Note that the requests have been split into various different kinds, each of which has different performance characteristics.
Request speed with proper variance overlaid
Worker pool sizes
How many apache processes does your application normally use? How often do you get to within 10% of the apache worker limit? What about tomcat worker threads? Pooled JDBC connections? Asynchronous worker pools? All of these rate-limiting mechanisms need careful observation, or you’re likely to run into hard-to-reproduce performance problems. Simply increasing pool sizes is inefficient, will more likely than not will just move the bottleneck to somewhere else, and will leave you with less protection against request-flooding denial of service attacks (deliberate or otherwise).
Apache worker pool, split by worker state. Remember that enabling HTTP keepalive on your front-end makes a huge difference to client performance, but will require a significantly larger pool of connections (most of which will be idle for most of the time)
If you have large numbers of very long-running requests bound by the network speed to your clients (e.g. large file downloads) consider either offloading them to your web tier, using mod_x_sendfile or similar. For long-running requests that are bound by server-side performance, (giant database queries or complex computations), consider making them asynchronous, and having the client poll periodically for status.
Helpdesk call statistics
The primary role of a helpdesk is to ensure that second and third-tier support isn’t overwhelmed by the volume of calls coming in. And many helpdesks measure performance in terms of the number of calls which agents are able to close without escalation. This can sometimes create a perverse incentive, whereby you release a new, buggy version of your software, users begin calling to complain, but the helpdesk simply issue the same workaround instructions over and over again – after all, that’s what they’re paid for. If only you’d known, you could have rolled back, or rolled out a patch there and then, instead of waiting for the fortnightly call volumes meeting to highlight the issue. If you can get near-real-time statistics for the volume of calls (ideally, just those related to your service) you can look for sudden jumps, and ask yourself what might be causing them
* “But but… my operations team doesn’t have graphs like this!” you say? Well, if only there were some kind of person who could build such a thing… hold on a minute, that’s you, isn’t it? Building the infrastructure to collect and manage this kind of data is pretty much a solved problem nowadays, and ensuring that it’s all in place really should be part and parcel of the overall development effort for any large-scale web application project. Of course, if operations don’t want to have access to this kind of information, then you have a different problem – which is topic for a whole ‘nother blog post some other time.