How to look an idiot in 1 easy step
(I debated whether this was one of those things better kept a secret, but in the end I decided it was more fun to share…)
So, we have a little system (based on Nagios ) that we use for monitoring all the various servers and applications that we run in webdev. It tracks the status of about 60 assorted web services of one sort and another and, amongst other things, keeps logs of performance. Like this.
Now, a couple of weeks ago I noticed that some of our services seemed to have got a bit more 'spiky'. Not by much – about 20 milliseconds or so more latency, and not terribly consistent.
I didn't pay it too much attention – it wasn't enough to be visible to end users, and we often get short periods of lag like this when then network's playing up.
Then today, it suddenly stopped, and the graph went flat again. Curious, I thought. So I rang up a colleague in the network team, and asked if they'd changed anything that might have improved performance for us. They hadn't, but requested a bunch of extra information so they could do some more diagnostics.
I started gathering some more data, and only then noticed that all of our services seemed to have got 20ms faster at almost exactly the same time, 10am. Surely they must have done something to the network at westwood (where the monitor is) to have affected so many boxes at once?
Then suddenly I had one of those 'oh bugger' moments, as I recalled going into our machine room at about 9:45 that morning, and noticing that someone (probably me) had left a console logged in to the monitoring server, which was now happily spinning away a screensaver. I logged it out at, yes, exactly 10 am.
So all the lag was in fact nothing whatsover to do with the network, it was in fact the monitoring server trying to task–switch between an openGL screensaver and a timed TCP connection. Ooooopie. Time for a hasty message to the network team to ask them not to look at it any further, and to apologise for wasting their time.
And the moral of this story? There isn't one really. 'Don't run screensavers on your production boxes' is a bit too obvious, isn't it? In my defence, I'd point out that the CPU load that the screensaver imposes is tiny – just a couple of percent – but it's a single–CPU box, and the cost of context–switching back and forth between X and the monitoring process is, oh, I don't know, about 20 milliseconds?