All entries for Thursday 17 February 2005
February 17, 2005
We've been experiencing some performance problems recently with BlogBuilder. Occasionally we've been having periods of up to 10 seconds where BlogBuilder has completly stopped responding. It didn't take too long to work out that this was the the garbage collection (GC) in the JVM. Only problem is…how do you fix it?
1) First things first, analyse the problem.
After a bit of faffing around I found that the easiest way to get some information about what is going on with GC is to set the following flag when you start up your application, in my case JBoss.
This will print out some nice stats such as these:
0.000: [Full GC 31662K->1547K(3114944K), 0.1660533 secs]
46.941: [Full GC 302615K->20372K(3114944K), 0.8736640 secs]
107.826: [Full GC 237781K->33703K(3114944K), 0.9595437 secs]
168.812: [Full GC 997600K->68269K(3114944K), 2.0233878 secs]
230.857: [Full GC 1045518K->85888K(3114944K), 1.9429575 secs]
278.238: [GC 1072896K->102427K(3114944K), 0.3734575 secs]
You can just look at these and see how long garbage collection is taking, in my case around 1 or 2 seconds per full GC. The problem comes once the application has been running for a while.
84564.203: [Full GC 1178937K->867583K(3114944K), 11.3700822 secs]
84618.071: [GC 1854591K->897218K(3114944K), 0.5409472 secs]
84635.640: [Full GC 1253982K->904732K(3114944K), 11.9113516 secs]
84697.271: [GC 1891740K->937871K(3114944K), 0.5528494 secs]
84707.583: [Full GC 1125289K->935547K(3114944K), 12.2230228 secs]
84760.543: [GC 1922555K->960498K(3114944K), 0.4644272 secs]
84779.833: [Full GC 1589435K->981758K(3114944K), 14.6843600 secs]
84854.553: [Full GC 1919262K->970964K(3114944K), 13.7111592 secs]
At a glance we can see things are going wrong, but to really get a good overview of what is happening, HPjTune comes to the rescue. This small java app analyses these logs and prints out some nice graphics for you.
As can be seen from this, the GC duration rises steadily and gets very spikey occasionally as well. From other information jTune provides, you can also find out the memory that is getting used up and freed up on each GC.
There are two main possible solutions to sorting out GC problems. These are:
- Use a better GC algorithm
- Reduce the memory used and number of objects created by your application
Using a better GC algorithm.
Tuning Garbage Collection with the 1.4.2 Java[tm] Virtual Machine tells you pretty much everything you need to know. The conclusion is that if we want GC to stop blocking, we need to use a parallel GC algorithm. This will only work on a multi-processor box though (which is ok because blogs does run on one), but it is difficult to test because we don't have a comparable test machine.
Reduce the memory used
The less memory there is used, the less memory the GC algorithm has to work through to clean it up. Simple. We use up a lot of memory in BlogBuilder because we have a very large second level Hibernate cache (using EHCache). I'm beginning to think it is too big and too long lived. Getting to the root cause of why we need such a big cache lead me to look a bit closer at some logs. It turns out BlogBuilder is pumping out a lot more SQL than it needs.
BlogBuilder is generating SQL inefficiently -> We need big caches so we don't re-run the SQL -> We get long GC periods -> We get bad performance.
It turns out you just can't get around not fixing the initial performance bottle necks, the caches will save you to a certain degree, but once you get a lot of data (and with 3000 blogs, 20000 entries and 40000 comments, we have a lot of data), you need to go back to basics and get the individual pages running efficiently.
3) Improve Hibernate SQL efficiency
BlogBuilder already has all collection relationships set as lazy so that data is only ever loaded when it is needed (this is a huge performance gain), but even so, when the data has to be loaded, there is no getting around the fact that some SQL is going to be issued.
The one optimisation I'd glaringly missed when first developing BlogBuilder was the batch-size settings for collections. If we have a permissions object which has two groups on it, staff and students, then Hibernate might issue three sql queries, one for the permission and one for each of the two groups. In practice, Hibernate is quite clever and will generally just issue two sql queries, one for the permission and then just a single query to get both groups. Good.
However, what setting batch-size does for you is instructs Hibernate to be even more aggressive in optimising the sql.
Now, say you have three permissions objects and each has two groups, you might normally have to issue either nine or six sql statements. If you set batch-size="10" on the groups relationship, Hibernate will try and load all groups across all three permissions in one go, meaning just four queries. Brilliant!
Using this and a couple of other methods, I reduced the number of queries for a single un-cached page down to one third of the original number.
At the end of the day, you must track down performance problems to their root cause, don't just work around them with a higher level fix. Hopefully this will mean that I can be less reliant on the caching for good performance, so I can reduce the cache size and timeouts, leading to better GC performance.