mod_xSendfile, or how to enjoy a slashdotting
Sitebuilder is our home-grown Content Management System. As these things go, it’s neither very big nor very small – it puts out about 500,000 page views, or 100GB of content per day. It’s a couple of instances of a java web app, fronted by apache, with an oracle back-end. The whole shebang runs on a single quad-opteron Sun V40z server.
Now, late last week I happened to be munging through some of the server stats, and noticed that the app servers spend about 90% of their time serving binary files. About 50% of this is spent serving ‘large’ files (over a few MB in size), which is interesting, because it’s pretty inefficient. Serving content straight out of a java web app is a bad thing, because, typically, the speed you can serve at is usually constrained by downstream network issues. So you end up with a load of threads in the app server sitting around, doing nothing but shoving bytes onto the network, waiting for an ack, and then repeating. Each of those threads is using up resources, and if you’re not careful and use things like open-connection-in-view, it’ll be using up database connections and the like as well.
On the odd occasions when you aren’t limited by the client bandwidth (e.g. when a bunch of students with GB connections straight into the campus network start downloading big files en masse) you run into a different class of problem. Putting a file onto the network in user space is pretty inefficient, with fiddly little byte buffers being copied around all over the place. Java can’t do all the cool stuff with sendfile() and mmap() that native code can do (Tomcat 6 can do some of this with JNI calls to APR, but we’re on v5) – which basically allows you to give the kernel a file handle and a network port, and say ‘here, put this in that, and tell me when you’re done’ and then forget about it.
Another downside is that it’s fairly hard to deploy a new version of a java webapp without interrupting running downloads. When we redeploy, we take each app server in turn out of the load-balancer pool, and wait for a minute or so for ‘important’ requests to finish. But big file downloads can go on for hours, so after a minute we just kill the remainder. Users will have to restart their downloads, which is a bit sucky (luckily we support range requests, so they can resume from where they left off, but even so…)
So it would be much better if we could let apache serve the content for us. But alas, we need to do all sorts of permisisons checking before serving the file, and hit-logging afterwards, so that’s not really practical.
I mulled over a few possible solutions to this involving serving redirects to caches, short-lived AWS-style temporary URLs, and the like, until the ever-alert Nick reminded me of what the right solution should look like. The LiveJournal guys solved this about 100 years ago with perlbal, a load-balancer that supported ‘reproxying’, which then got re-implemented in lighttpd and nginx as the standardised-non-standard ‘X-sendfile’ header.
Basically, instead of letting your app server stream bytes out via apache to the end user, the app server just sets a HTTP header of “X-sendfile:/path/to/file” and the webserver then serves it exactly like a normal file, using as many clever tricks for large-file-serving as it knows how to do.
“It strikes me” quoth Howes, “that there might be an apache module to do something like this”. And of course, 2 minutes with google turned up mod_xsendfile. Hurrah!
The module built without mess or fuss on our solaris 10/coolstack apache 2.2.6 boxes, so we plugged it in, and Nick wrote a quick extension to our existing file-serving code to let us switch on or off support for X-Sendfile as required. We checked in the code, and immediately saw a massive improvement on the test server – about three times the throughput, with lower CPU utulisation to boot. The only downside is there’s no support yet for the X-Sendfile-ranges header, so we have to carry on doing range requests within the app server.
Now, normally this code would have sat on the test server for a week or so before getting pushed out to live, but a bug elsewhere in the app meant that we had to deploy a new version this morning, including the x-sendfile code (though the feature was turned off). After lunch I took advantage of the usual friday-afternoon lull to restart the production apache, loading in mod_xsendfile. So we were all ready to go, but still it didn’t seem like it was worth switching on just yet.
The all of a sudden at 3:15pm, something rather unexpected happened.

Network load on the main web server went from a usual friday-afternoon 10Mb/s to 50, then 100, then 200, then 250Mb/s, in the course of about 30 minutes.CPU on the server started climbing, and a quick look showed the two java instances working pretty hard, along with the haproxy load-balancer that sits between them and apache. A look in the logs showed dozens and dozens of requests for the same 200MB video file. WTF?
Well, it turns out that Apple rather like our Shakespeare podcasts . And a 8:15 PST they put a link to this particular one (available through iTunesU) onto their homepage, right at the top of the “hot news” section. As you can imagine, being on Apple’s homepage can cause something of a bump in your traffic stats…
Well, no time like the present. We flipped the x-sendfile switch, and straight away the CPU utilisation stopped climbing, whilst the network continued to rocket up, reaching a peak of about 400Mb/s. App server CPU didn’t decrease immediately, as the app servers were still handling a couple of hundred requests from before the switch, but gradually over the next hour (some people take a long time to download 200MB of content!) the CPU dropped away, whilst the network throughput stabilised at about 300Mb/s. As of now, five hours later, we’ve served almost a terabtye of data from this one podcast alone, and we’re still going strong.
I’m fairly sure that the server would have been OK had we not switched it. But this kind of sustained load is way higher than we ever imagined we’d have to support. The fact that we’d implemented the x-sendfile support literally an hour before the strikes me as a splendidly fortunate co-incidence. In the words of the poet; God pats me on the head and says ‘good boy’ :-)
Chris May
Loading…
Tom Abbott
Love it when things come together.
13 Mar 2009, 21:49
Mathew Mannion
From your V40z post…
And I dare say that transferring over 1TB of data in a day didn’t even make it blink.
14 Mar 2009, 12:32
Phil Wilson
This is a good writeup, and has some handy hints for us at Bath.
Is there any reason you’re load balancing your Tomcat instances rather than clustering through the built-in functionality or something like Terracotta? At the moment we currently server off of single app servers(!) and so the Terracotta-alike options seem tempting at the mo.
05 Apr 2009, 22:00
Chris May
Hi Phil, thanks for the comment.
As far as I can see it’s not really an either-or decision.
Terracotta and Tomcat clustering are both means of sharing in-memory state (session state, and/or a DB entity cache, typically) between multiple JVMs so that any JVM can handle any incoming request – but you still need something in front of your JVMs to route those requests to one of the JVMs. So you either have load-balancing + a clustering product, or load-balancing + an architecture with no shared state.
We’ve gone for the no-shared-state (outside the database) architecture, for a few reasons:
1: Simpler architecture. One less thing to worry about – request comes in, we fetch all the data it needs from the DB, we execute the request, The End. If we need to scale it, we just point extra JVMs at the same DB. If a node goes down, the other nodes can pick up its workload just fine.
2: Easier to debug – if you’ve got some odd problem that you’re looking into, all you need is the relevant bit of the database, and the request logs, and you can reproduce it locally. no need to try and guess what state the session is in.
3: Fewer long-lived objects in the JVM – if you don’t hold on to references outside of request scope, and your garbage colllections are less frequent than the duration of your requests, then most of your garbage gets collected out of the eden generation, which is orders of magnitude more efficient than allowing them to move into the tenured generations. This is a very big win if your JVMs are busy.
4: More stateless requests. If you have no session state then it encourages developers to design workflows composed of stateless requests, which tend to be easier to integrate with as APIs.
There will, presumably, come a point where forcing every request to go back to the database will stop scaling. At that point we’d either need a clustered cache in front of the database, or a clustered database. However, we’re a long way off hitting that – even at peak levels of use, at the start of term (serving thousands of requests/min), the database still isn’t more than about 10% busy. So in theory at least, we could increase our hits by an order of magnitude without needing to worry about extra capacity.
If we did reach that point, I’d be more inclined to look at sharding the application, rather than clustering, simply so that I can keep the current architectural simplicity
06 Apr 2009, 09:13
Phil Wilson
Thanks for that Chris, very useful indeed.
06 Apr 2009, 10:39
Add a comment
You are not allowed to comment on this entry as it has restricted commenting permissions.