January 15, 2009

NIS on app servers is EVIL

OK, so I’m finally (I think) getting to the bottom of our longest-running performance issue.

We have an apache server which occasionally seems to be unable to handle requests. To begin with, the symptoms were something like this; at certain times of day, the number of apache worker processes (we’re using the prefork MPM) would go through the roof, but no requests ever completed. Restarting the server seemed to help sometimes, other times we’d restart and the server would just sit there not spawning any httpd’s at all. It was all a bit of a mystery.

The times at which this happened seemed to co-incide with when our backups were running, so my first thought was file-locking – perhaps the backups were preventing apache from getting a lock on a mutex file, or something like that. But disabling the backups didn’t have any effect. Then I wondered if it might be a memory shortage (since we’d had similar problems on another server recently, caused by it running out of swap space due to a leaky httpd). Again, investigations didn’t show anything up.

Then, I looked in the conf file, and found a couple of proxying redirects, like this:

RewriteRule (.*) http://foo.bar.ac.uk/buz$1 [P] 

Alarm bells went off immediately; this is going to require a host name lookup on every request. Now, that ought not to matter, since (on solaris) ncsd should be cacheing those lookups – but nscd is suspected to have ‘issues’, particularly under heavy concurrent loads.

So, step 1; replace host names with IP addresses. Sure, we might one day need to update them if we ever change DNS, but that’s not something that happens often.

This certainly helped matters, but didn’t exactly fix them. We got fewer, shorter-lived slowdowns, but they were still there. However, something had changed. Whereas before we were getting loads of httpd processes, now we’d have barely any, until suddenly we’d get 200 being spawned at once (shortly followed by the problem going away).

Running pstack on the top-level apache whilst it was stuck like this was revealing:

 feb65bfa door     (4, 8047658, 0, 0, 0, 3)
 feaff286 _nsc_try1door (feb8f428, 8047788, 804778c, 8047790, 8047758) + 6c
 feaff4f0 _nsc_trydoorcall_ext (8047788, 804778c, 8047790) + 178
 feb0c247 _nsc_search (feb8f228, feaf767c, 6, 80477f4) + b5
 feb0af3f nss_search (feb8f228, feaf767c, 6, 80477f4) + 27
 feaf7c0f _getgroupsbymember (81bd1c0, 867b220, 10, 1) + dc
 feb00c5b initgroups (81bd1c0, ea61, 8047c88, 808586e) + 5b
 080858a5 unixd_setup_child (0, 0, 0, 0, 0, 867b4b0) + 41
 0806d0a3 child_main (10, 1, 1, 0) + e7
 0806d52b make_child (0, f, 7, e, 4, 0) + d7
 0806df01 ap_mpm_run (80b3d70, 80dfe20, 80b5b50, 80b5b50) + 93d
 08072f67 main     (4, 8047e40, 8047e54) + 5cb
 08067cb4 _start   (4, 8047ed8, 8047ef5, 8047ef8, 8047efe, 0) + 80

The top-level apache is trying to fork a new worker. But in order to do that, it needs to set the user and group privileges on the new process, and in order to do that, it needs to find the groups that the user belongs to. Since this server uses NIS to specify groups, apache has to make a call to NIS (via nscd), to list all the groups (despite the fact that the web server user isn’t actually a member of any NIS groups – it has to make the call anyway, to verify that this is the case).

So, for some reason, NIS is being slow. Maybe as a result of the high traffic levels that the backups are pushing around, the NIS requests are taking a very long time to process, and that’s preventing apache from forking new workers. When NIS finally comes back, apache has loads of requests stacked up in the listen backlog, so it spawns as many workers as it can to process them – hence the sudden jump just before everything starts working again.

To test this theory out, I wrote a teeny script that just did

time groups webservd

every 30 seconds, and recorded the result. To my dismay, lookups could take anything from 1 second to 5 minutes. Clearly, something’s wrong. Unsurprisingly, the slow lookups coincided with the times that apache was slow. running the same check on the NIS server itself revealed no such slowness; lookups were consistently returning in <1 second.

So, a fairly simple solution: Make the web server a NIS slave. This appears to have solved the problem, so far (though it’s only had a day or so of testing so far). Why a busy network should cause NIS lookups to be slow on this particular server (other servers in the same subnet were unaffected) I have no idea. It’s not an especially great solution though, particularly if I have to apply it to lots of other servers (NIS replication times scale with the number of slaves, unless we set up slaves-of-slaves).

A nicer long-term solution would be to disable NIS groups entirely. On an app/web server there’s no great benefit to having non-local groups, it’s not as if we’ve got a large number of local users to manage. Alterntatively, using a threaded worker model would sidestep the problem by never needing to do NIS lookups except at startup.

- 3 comments by 2 or more people Not publicly viewable

  1. Steve Rumsby

    I’ve never used NIS on critical servers. As you say, it is not as if your really need it, and the consequences of it going wrong make it not worth it. Actually, my opinion of NIS was built when I ran it 10+ years ago and I was thinking it was probably much better now, and that I might start using it again. Thanks for putting me straight…!

    15 Jan 2009, 17:25

  2. lamby

    Somewhat similar to this, there are a whole catalogue of issues where daemons do synchronous reverse DNS lookups upon receiving connections. OpenSSH and MySQL are the worst offenders I’ve seen, although I’d the latter is better since version 5.0. Would be curious to know why you aren’t using the mpm-worker already though.

    15 Jan 2009, 18:10

  3. Chris May

    Would be curious to know why you aren’t using the mpm-worker already though.

    Intertia, pretty much. We’ve been using prefork since forever, and it’s a case of ‘better the devil you know’.

    Every now and then I think about switching, then I read something about some leak or bug in the worker MPM (worker + mod_ssl + solaris seems a bit flaky), then I start wondering if nginx or lighttpd wouldn’t be a better choice, and by this time it’s such a complex set of choices I go back to my trusty prefork again :-)

    Prefork does have a few advantages; leaks are generally easy to contain, it’s easy to debug individual workers, and worker crashes are rarely fatal. But TBH if I were choosing my infrastructure today, for a green-field project, it wouldn’t be top of my list.

    15 Jan 2009, 18:36

Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

Most recent entries


Search this blog

on twitter...


    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXXI