All entries for Wednesday 10 September 2008

September 10, 2008

apache "fork: Unable to fork new process" errors on solaris

(note google-bait title; I hope this helps someone else out).

So, we had a problem where, every now and then, a sudden rush of requests to our webserver would lead to apache saying

“fork: Unable to fork new process”

in the error logs, once it tried to spawn more than ~ 400 httpds – and for a little while no-one got any webbage. I spent some time looking into why this should be, and never really got anywhere. In each case, a hard restart of apache would fix it. I could see that there was a problem with our apache; it starts out life at ~8MB per httpd process, but after a few months of running (with lots of “apachectl graceful”’s, but no restarts) would be more like 100MB. But, looking into the process, almost all of that was shared memory:

# pmap -ax 4987
 4987:   /opt/coolstack/apache2/bin/httpd -k start
 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
0803F000      36      36      12       - rwx--    [ stack ]
08050000     320     312       -       - r-x--  httpd
080AF000      12      12       8       - rwx--  httpd
080B2000       4       4       4       - rwx--  httpd
080B3000  116456  116324     104       - rwx--    [ heap ]
FDFA0000     184     184      16       - rw-s-    [ anon ]
FE000000     504     184       -       - rw-s-    [ anon ]
FE080000      64      16       -       - rwx--    [ anon ]
FE0A0000      24      24       -       - r-x--  mod_proxy_http.so
... other inconsequential items...
-------- ------- ------- ------- -------
total Kb  124144  121736     220       -

so this shouldn’t matter. Even if there were 1000 httpds, with an anonymous allocation of 220K each, that isn’t going to make a dent on our server, which has about 50GB of VM in total.

Additionally, Solaris maintains a cache for ZFS file systems, which will, by default use up almost all of the RAM on the box. However, the cache allocations are special; a call to fork() or malloc() is allowed to eat into cache memory whenever it needs to. But I could see on our box that the ZFS cache was sat at about 20GB – so if ZFS is still using all this RAM, why can’t apache?

Well, predictably enough, my failure to analyse the problem came back to bite us. One day, instead of just apache being unable to fork, the whole box locked up. I couldn’t even run ‘ps’ to find a pid to kill. So, we transferred the service as quickly as possible onto a standby box, and left the wedged server to itself.

Clearly, a deeper understanding was required. I went to chat with our resident solaris guru, who explained what was going on.

When unix fork()s a process, the OS doesn’t yet know how much of that process’s memory allocation is going to be shared, and how much will remain local. So, it has to reserve enough for the entire space (i.e. 100MB per process, in the case of our apaches).

In Linux, the OS will ‘overcommit’, and allow processes to carry on forking, even when all the virtual memory has already been allocated. In the unlikely event that all the processes actually need all the space they’ve been allocated, the ‘OOM killer’ comes into play; a daemon which looks for processes using a large amount of RAM and kills them. This makes things very efficient, but a little unpredictable.

On solaris, by comparison, overcommit is not allowed. If you want to fork() a 100MB process, there must be 100MB of free virtual memory left on the system. So it’s now easy to see why our httpd’s were failing to fork: 100MB * 400 process = 40GB – once you’ve added in the 10GB of oracle SGA, 5GB of java heap, and sundry other processes, that’s everything all gone.

Meanwhile, what about that 20GB of ZFS cache? Well, it turns out that this is allowed to share space with reserved-but-not-used VM. Since all the apaches were only actually using a tiny bit of their reservation, there was plently of space for the ARC cache to sit in.

So, there are a couple of solutions:

1) Allocate a shedload of swap space. Knowing that it’ll never actually get used; it wouldn’t really hurt to have, say, 100G of swap sitting idle. Except that we’d need to get some more disks.

2) Stop apache leaking. This would be the ideal solution; a webserver that takes 100MB of heap does seem a bit on the excessive side, even to a hardened java programmer like me ;-). But whether it’s possible or not, I don’t know. The standby server has a more up-to-date version of apache, so maybe the problem will magically fix itself…

3) Periodically restart apache. Ugghh. Really? This isn’t windows, you know…periodic hard restarts of user-facing services, with all the associated risk and downtime, are really not something I want to get into.

4) Front apache with squid (or haproxy, varnish, an F5, whatever), and periodically swap between two separate apache instances, allowing either one to be killed off as required. Better, but a helluva lot of extra infrastructure just to fix a leaky webserver

5) Use lighttpd. Hmm…..

Update not quite the same as our problem, but I’m reproducing it here for the benefit of anyone else suffering; a reader contacted me to observe that a recent Sun patch had upped the ServerLimit directive to 2048, and that this had led to very high (>100MB/process) memory use. I can see how this could be the case, particularly if you’re using a multithreaded MPM like Worker, so it’s worth watching out for.


Finding deleted files in solaris

Today I was doing some routine cleaning-up of /tmp directories, when I ran into a problem:

I spotted a big file that could be deleted: /tmp/logfiles/00000001/00000001 . It’s 1.6GB, and all it’s doing is eating up swap space.

So, I run rm /tmp/logfiles/00000001/00000001, and the file goes, but df -h /tmp shows that the space is in use.

Bugger. Some process somewhere still has an open file handle on the file. The space won’t get reclaimed until I can find and kill that process.

lsof could do this, but for various tedious reasons I don’t have lsof on this box.

However, 1.6GB is unusually big for a file. There can’t be many processes with a handle on a file that’s more than 1GB in size can there?

Procfs to the rescue! /proc in solaris has a directory for every process running on the system, and within that an FD directory with a magical link to every open file handle. Even if the file has been rm-ed, the df entry will still be able to retreive it’s bytes. fd entries don’t name the files, but they do list their sizes.

So, all I need to do is look through all the FD directories, for a file of ~ 1.6 GB in size, and that will give me the process ID that’s got it open…

take 1:

for file in `ls -l  | grep root | awk '{print $9}' `; do echo "PID $file"; ls -l $file/fd 2>/dev/null; done | awk '$5 > 1000000 || /^PID/' | grep -v ','

clunky, but it works! It gives me back a PID, and running ‘pargs’ on the PID, I get

-bash-3.00# pargs 1859
1859:   less logfiles/00000001/00000001
argv[0]: less
argv[1]: logfiles/00000001/00000001

- someone {ahem} tried to less the file earlier, and gave up when it didn’t work :-).

A quick kill, and my swap space is back again.

Now, I’m fairly sure that I could do the same thing with a single ‘find’ incantation. That can wait for v2, though…


Most recent entries

Loading…

Search this blog

on twitter...


    Tags

    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXXI