All entries for Thursday 15 January 2009
January 15, 2009
OK, so I’m finally (I think) getting to the bottom of our longest-running performance issue.
We have an apache server which occasionally seems to be unable to handle requests. To begin with, the symptoms were something like this; at certain times of day, the number of apache worker processes (we’re using the prefork MPM) would go through the roof, but no requests ever completed. Restarting the server seemed to help sometimes, other times we’d restart and the server would just sit there not spawning any httpd’s at all. It was all a bit of a mystery.
The times at which this happened seemed to co-incide with when our backups were running, so my first thought was file-locking – perhaps the backups were preventing apache from getting a lock on a mutex file, or something like that. But disabling the backups didn’t have any effect. Then I wondered if it might be a memory shortage (since we’d had similar problems on another server recently, caused by it running out of swap space due to a leaky httpd). Again, investigations didn’t show anything up.
Then, I looked in the conf file, and found a couple of proxying redirects, like this:
RewriteRule (.*) http://foo.bar.ac.uk/buz$1 [P]
Alarm bells went off immediately; this is going to require a host name lookup on every request. Now, that ought not to matter, since (on solaris) ncsd should be cacheing those lookups – but nscd is suspected to have ‘issues’, particularly under heavy concurrent loads.
So, step 1; replace host names with IP addresses. Sure, we might one day need to update them if we ever change DNS, but that’s not something that happens often.
This certainly helped matters, but didn’t exactly fix them. We got fewer, shorter-lived slowdowns, but they were still there. However, something had changed. Whereas before we were getting loads of httpd processes, now we’d have barely any, until suddenly we’d get 200 being spawned at once (shortly followed by the problem going away).
Running pstack on the top-level apache whilst it was stuck like this was revealing:
feb65bfa door (4, 8047658, 0, 0, 0, 3) feaff286 _nsc_try1door (feb8f428, 8047788, 804778c, 8047790, 8047758) + 6c feaff4f0 _nsc_trydoorcall_ext (8047788, 804778c, 8047790) + 178 feb0c247 _nsc_search (feb8f228, feaf767c, 6, 80477f4) + b5 feb0af3f nss_search (feb8f228, feaf767c, 6, 80477f4) + 27 feaf7c0f _getgroupsbymember (81bd1c0, 867b220, 10, 1) + dc feb00c5b initgroups (81bd1c0, ea61, 8047c88, 808586e) + 5b 080858a5 unixd_setup_child (0, 0, 0, 0, 0, 867b4b0) + 41 0806d0a3 child_main (10, 1, 1, 0) + e7 0806d52b make_child (0, f, 7, e, 4, 0) + d7 0806df01 ap_mpm_run (80b3d70, 80dfe20, 80b5b50, 80b5b50) + 93d 08072f67 main (4, 8047e40, 8047e54) + 5cb 08067cb4 _start (4, 8047ed8, 8047ef5, 8047ef8, 8047efe, 0) + 80
The top-level apache is trying to fork a new worker. But in order to do that, it needs to set the user and group privileges on the new process, and in order to do that, it needs to find the groups that the user belongs to. Since this server uses NIS to specify groups, apache has to make a call to NIS (via nscd), to list all the groups (despite the fact that the web server user isn’t actually a member of any NIS groups – it has to make the call anyway, to verify that this is the case).
So, for some reason, NIS is being slow. Maybe as a result of the high traffic levels that the backups are pushing around, the NIS requests are taking a very long time to process, and that’s preventing apache from forking new workers. When NIS finally comes back, apache has loads of requests stacked up in the listen backlog, so it spawns as many workers as it can to process them – hence the sudden jump just before everything starts working again.
To test this theory out, I wrote a teeny script that just did
time groups webservd
every 30 seconds, and recorded the result. To my dismay, lookups could take anything from 1 second to 5 minutes. Clearly, something’s wrong. Unsurprisingly, the slow lookups coincided with the times that apache was slow. running the same check on the NIS server itself revealed no such slowness; lookups were consistently returning in <1 second.
So, a fairly simple solution: Make the web server a NIS slave. This appears to have solved the problem, so far (though it’s only had a day or so of testing so far). Why a busy network should cause NIS lookups to be slow on this particular server (other servers in the same subnet were unaffected) I have no idea. It’s not an especially great solution though, particularly if I have to apply it to lots of other servers (NIS replication times scale with the number of slaves, unless we set up slaves-of-slaves).
A nicer long-term solution would be to disable NIS groups entirely. On an app/web server there’s no great benefit to having non-local groups, it’s not as if we’ve got a large number of local users to manage. Alterntatively, using a threaded worker model would sidestep the problem by never needing to do NIS lookups except at startup.