Wierd Apache / mod_jk / Jboss keepalive bug
A curious bug manifested itself on our production web server yesterday. The server runs apache, with MaxClients=500, and handles about 3 million hits per day. About 80% of requests are for static files, and the remainder are dynamic, handled by mod_jk delegating to 2 jboss instances (round-robin load-balancing), each with 250 tomcat worker threads.
Now, round about mid-morning, we got a fairly big spike in the number of requests per second coming through, and very quickly the site became unresponsive. Apache reported about half of it’s workers being used for keepalives, and the remainder waiting on a response from jboss.
Jboss reported all of it’s mod_jk worker processes were busy, but only a very small number were actually involved in servicing requests, most were in a ‘K’ state (keepalive). The load on the box was very light (only a few % CPU) and testing via the jboss HTTP interface (which was unaffected by mod_jk) suggested that there were no problems actually handling requests.
Now, a mod_jk/AJP keep-alive, as I understand it, isn’t supposed to work like an HTTP keepalive. An apache worker that’s in a keepalive state is reserved for the exclusive use of the client (browser) that’s connecting to it. If you have too many distinct clients connecting and holding keepalives, you’ll run out of httpds. But, an AJP keepalive is a keepalive between the apache server and the jboss server; although it knows the client IP address, it’s not limited to only serving requests from that client. Even if it is, it should recognize when apache terminates the keepalive connection with the client (after 15 seconds or 100 requests in our case) and make itself available for other client connections.
However, in our setup that didn’t seem to be the case. It seems as if, once an AJP connection is marked as holding a keepalive to a client IP address, it will not service requests from any other IP address for some relatively long period of time (Not infinite, but much longer than apache, anyway). The result of this is that it doesn’t take long under load for all of your worker threads to be tied to particular clients, waiting for the next request.
We solved the problem, rather cludgily, by simply disabling client keepalives in the web server. This makes the process of rendering pages slightly slower (since each request must set up and tear down a TCP connection for every image, stylesheet, and other static resource the page references) but it’s not really noticable. It’s had the additional benefit that our apache server has gone from having about 250 active httpds on average to about 50.
A much better solution (apart from fixing mod_jk), IMO, would be to ditch mod_jk in favour of a connectionless HTTP load balancer like haproxy. Then we could re-enable keepalives from the web server to the client, enabling fast loads for the 80% of static content, but disable them from the web server to the JBoss server, thus preventing Jboss from spawning too many threads which are just sitting idle waiting on a client keep-alive. Plus, switching to HTTP would allow us to do funky pipeline things like sticking a Squid cache between web server and JBoss, to further speed things up. Additionally, the overhead of TCP connection establishment, whilst it might be significant on a static request for a 2K stylesheet, simply doesn’t figure in the time to render a page with 30-odd database queries and a stack of java code behind it.