Favourite blogs for Cubicle 23

My favourites » Secret Plans and Clever Tricks

January 15, 2009

NIS on app servers is EVIL

OK, so I’m finally (I think) getting to the bottom of our longest-running performance issue.

We have an apache server which occasionally seems to be unable to handle requests. To begin with, the symptoms were something like this; at certain times of day, the number of apache worker processes (we’re using the prefork MPM) would go through the roof, but no requests ever completed. Restarting the server seemed to help sometimes, other times we’d restart and the server would just sit there not spawning any httpd’s at all. It was all a bit of a mystery.

The times at which this happened seemed to co-incide with when our backups were running, so my first thought was file-locking – perhaps the backups were preventing apache from getting a lock on a mutex file, or something like that. But disabling the backups didn’t have any effect. Then I wondered if it might be a memory shortage (since we’d had similar problems on another server recently, caused by it running out of swap space due to a leaky httpd). Again, investigations didn’t show anything up.

Then, I looked in the conf file, and found a couple of proxying redirects, like this:

RewriteRule (.*) http://foo.bar.ac.uk/buz$1 [P] 

Alarm bells went off immediately; this is going to require a host name lookup on every request. Now, that ought not to matter, since (on solaris) ncsd should be cacheing those lookups – but nscd is suspected to have ‘issues’, particularly under heavy concurrent loads.

So, step 1; replace host names with IP addresses. Sure, we might one day need to update them if we ever change DNS, but that’s not something that happens often.

This certainly helped matters, but didn’t exactly fix them. We got fewer, shorter-lived slowdowns, but they were still there. However, something had changed. Whereas before we were getting loads of httpd processes, now we’d have barely any, until suddenly we’d get 200 being spawned at once (shortly followed by the problem going away).

Running pstack on the top-level apache whilst it was stuck like this was revealing:

 feb65bfa door     (4, 8047658, 0, 0, 0, 3)
 feaff286 _nsc_try1door (feb8f428, 8047788, 804778c, 8047790, 8047758) + 6c
 feaff4f0 _nsc_trydoorcall_ext (8047788, 804778c, 8047790) + 178
 feb0c247 _nsc_search (feb8f228, feaf767c, 6, 80477f4) + b5
 feb0af3f nss_search (feb8f228, feaf767c, 6, 80477f4) + 27
 feaf7c0f _getgroupsbymember (81bd1c0, 867b220, 10, 1) + dc
 feb00c5b initgroups (81bd1c0, ea61, 8047c88, 808586e) + 5b
 080858a5 unixd_setup_child (0, 0, 0, 0, 0, 867b4b0) + 41
 0806d0a3 child_main (10, 1, 1, 0) + e7
 0806d52b make_child (0, f, 7, e, 4, 0) + d7
 0806df01 ap_mpm_run (80b3d70, 80dfe20, 80b5b50, 80b5b50) + 93d
 08072f67 main     (4, 8047e40, 8047e54) + 5cb
 08067cb4 _start   (4, 8047ed8, 8047ef5, 8047ef8, 8047efe, 0) + 80

The top-level apache is trying to fork a new worker. But in order to do that, it needs to set the user and group privileges on the new process, and in order to do that, it needs to find the groups that the user belongs to. Since this server uses NIS to specify groups, apache has to make a call to NIS (via nscd), to list all the groups (despite the fact that the web server user isn’t actually a member of any NIS groups – it has to make the call anyway, to verify that this is the case).

So, for some reason, NIS is being slow. Maybe as a result of the high traffic levels that the backups are pushing around, the NIS requests are taking a very long time to process, and that’s preventing apache from forking new workers. When NIS finally comes back, apache has loads of requests stacked up in the listen backlog, so it spawns as many workers as it can to process them – hence the sudden jump just before everything starts working again.

To test this theory out, I wrote a teeny script that just did

time groups webservd

every 30 seconds, and recorded the result. To my dismay, lookups could take anything from 1 second to 5 minutes. Clearly, something’s wrong. Unsurprisingly, the slow lookups coincided with the times that apache was slow. running the same check on the NIS server itself revealed no such slowness; lookups were consistently returning in <1 second.

So, a fairly simple solution: Make the web server a NIS slave. This appears to have solved the problem, so far (though it’s only had a day or so of testing so far). Why a busy network should cause NIS lookups to be slow on this particular server (other servers in the same subnet were unaffected) I have no idea. It’s not an especially great solution though, particularly if I have to apply it to lots of other servers (NIS replication times scale with the number of slaves, unless we set up slaves-of-slaves).

A nicer long-term solution would be to disable NIS groups entirely. On an app/web server there’s no great benefit to having non-local groups, it’s not as if we’ve got a large number of local users to manage. Alterntatively, using a threaded worker model would sidestep the problem by never needing to do NIS lookups except at startup.


May 19, 2008

Spring extensible XML and web controllers

So, having got spring XML config extensions working quite nicely, I thought I’d have a go at another area of our codebase that’s been bugging me.

We have a few areas where we re-use the same MVC controller many times, with a different command and a different view. So the config looks something like this:

<bean name="/view/userbreakdown.htm" id="userBreakdownController" class="uk.ac.warwick.sbr.web.hitlog.HitLogStatsController">
        <property name="commandClass" value="uk.ac.warwick.sbr.hitlog.HitLogStatsUserInfoBreakdownCommand"/>
        <property name="statsView" value="hitLogUserBreakdownView"/>
</bean>

<bean name="/view/ipbreakdown.htm" id="ipBreakdownController" class="uk.ac.warwick.sbr.web.hitlog.HitLogStatsController">
        <property name="commandClass" value="uk.ac.warwick.sbr.hitlog.HitlogStatsIPInfoBreakdownCommand"/>
        <property name="statsView" value="hitLogIPBreakdownView"/>
</bean>

... continue for many more HitLogStatsControllers...

Hmm, I thought, would’t this be nicer if we could just say

   <stats:controller name="/view/all.htm" id="summaryController" command="HitLogStatsSummaryCommand" view="hitLogAllView"/>

?

Well, it turns out to be harder than you might think. As you can see from the bean definitions, we’re using the BeanNameUrlHandlerMapping to let SMVC map requests onto controllers. This relies on setting the name attribute of a bean to the URL you want (You can’t use the ID, because slashes are illegal for ID attributes). N.b. this is an attribute of the bean def; not a property of the bean.

So, we need to set the bean name. But, this doesn’t appear possible using
NamespaceHandlerSupport. The name isn’t actually an attribute of the BeanDefinition itself, rather it’s part of the BeanDefinitionHolder class. You set it using the 3-arg constructor of BeanDefinitionHolder. Alas, all beans defined by non-default XML have their BeanDefinitionHolders created for them in AbstractBeanDefinitionParser.parse, which calls the 2-arg version of the constructor (which doesn’t set a beanName). Default XML elements, by comparison, are created in BeanDefinitionParserDelegate, which uses the 3-arg version.

So, can we fix it? Making the custom parsing code call the 3-arg constructor would involve ripping a great deal of the guts of the XML parsing code out; not something I’d be too keen on. Maybe I should raise a JIRA with the Spring MVC team.

An easier solution might be to write a different HandlerMapping, that used a bean property (“path”, say) rather than bean names/aliases to store the URL path in. This strikes me as a nicer solution (not least because it doesn’t overload the bean name with behaviour that’s nothing to do with names), though I don’t know whether it would perform as well ( lookups presumably get cached,though, so it would be a one-off cost).

Alternatively, I could convert the one-controller-several-commands model into several ThrowawayControllers (all inheriting a common base), and then use the annotation-based config to set them all up. This seems like it might be a neater long-term solution, so long as there’s nothing that’s too expensive to set up in the controllers (which can’t be pushed out into an injected service).


May 13, 2008

Making Spring XML config better

One of my pet peeves with Spring is the way that, left unchecked, it can grow yards and yards of inscrutiable XML. Over the last few days, amongst other things, I’ve been looking at whether we can improve things.

Here’s an example to get started with. We use classes called ModelAccessors to hold references to data that our Spring WebFlow processes need. A typical flow might have half-a-dozen ModelAccessors, for all of it’s various bits of state. They look like this in the application context:

  <bean id="emptyFilesAccessor" class="uk.ac.warwick.sbr.webflow.FlowScopeModelAccessor">
    <constructor-arg index="0" value="duplicateFiles"/>
    <constructor-arg index="1" value="java.util.List"/>
  </bean>
  <bean id="invalidFileNamesAccessor" class="uk.ac.warwick.sbr.webflow.FlowScopeModelAccessor">
    <constructor-arg index="0" value="createdFiles"/>
    <constructor-arg index="1" value="java.util.List"/>    
  </bean>

  <bean id="uploadZipFileFormAccessor" class="uk.ac.warwick.sbr.webflow.FlowScopeModelAccessor">
    <constructor-arg index="0" value="uploadZipFileForm"/>
    <constructor-arg index="1" value="uk.ac.warwick.sbr.webflow.action.upload.UploadZipFileForm"/>
  </bean>
  ... continues for many more...

Now, there’s a couple of problems with this:
1) There’s 4 lines of XML for every accessor, and only 2 things ever change; the ID and the second constructor argument (the first arg is derivable from the ID). Even the second argument is usually just ‘java.util.List’

2) They’re not very communicative. The most important thing about this element (that it’s a ModelAccessor) is an attribute. There’s very little information about what those constructor-arguments actually mean. Surely we could have something a bit more expressive?

And luckily, there’s a simple fix for both of these problems. Spring has support for extending the XML context syntax by adding in your own custom namespaces. Define your extension in an XSD schema, write a parser plug-in, and away you go:

<sbr:model-accessor id="uploadZipFileFormAccessor" modelclass="uk.ac.warwick.sbr.webflow.action.upload.UploadZipFileForm"/>
  <sbr:list-model-accessor id="invalidFileNamesAccessor"/>
  <sbr:list-model-accessor id="emptyFilesAccessor" />

- much nicer.

So, how much effort is this to implement? Not that much, as it turns out. The instructions are a pretty good start. The only annoyance you’re likely to face is cryptic SAX parsing errors like this:

 org.springframework.beans.factory.xml.XmlBeanDefinitionStoreException: Line 8 in XML document from class path resource [uk/ac/warwick/sbr/spring/sbr-modelaccessor.xml] is invalid; nested exception is org.xml.sax.SAXParseException: cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element 'sbr:model-accessor'.

What this means is that somewhere in either the header of your context.xml file, or in your META-INF/spring.handlers file, there’s a typo, so the parser can’t go from xmlns: element to xsi:schemalocation element to spring.handlers mapping.

Spring provides a base class for the NamespaceHandler and BeanDefinitionParser, so there’s really not much work to do in implementing them:

    public void init() {
        registerBeanDefinitionParser("model-accessor", new ModelAccessorBeanDefinitionParser()); 
    }
    final class ModelAccessorBeanDefinitionParser extends AbstractSingleBeanDefinitionParser{
        protected Class getBeanClass(Element element) {
            return FlowScopeModelAccessor.class; 
         }
        protected void doParse(Element element, BeanDefinitionBuilder bean) {
         bean.addConstructorArgValue(element.getAttribute("id").replaceAll("Accessor$", ""));           
         bean.addConstructorArgValue(element.getAttribute("modelclass"));               
         }

Easy!


May 01, 2008

The SpringSource Application Platform: why would I want it?

In my experience, software development projects can be divided into two worlds: those that can be done with a small team, and those that need a big team.

By “small team”, I mean about 4 people, and certainly no more than 6. Basically, if you need more than 2 pizzas to feed the team, it’s not small. Big teams, on the other hand, are typically more than 10. That’s because there’s an interesting phenomenon which happens on teams between about 6-10 people, which is that suddenly the effort to communicate between everyone in the team becomes much larger, so much so that adding more people actually makes things worse, until you get above about 10. Then you can start to form a normal hierarchy (2 sub-teams and a co-ordinating team) and get going again.*

Small teams, I have found, are much more productive per person. Before I came to Warwick I was involved in a 100-person development team (40 java coders, 20 business analysts/testers, 20 assorted managers and 20 admins/testers/other hangers-on) which eventually collapsed under it’s own inefficiency after 18 months. The best 6 people were pulled from the wreckage, and re-implemented the entire project in 6 months flat. This is not unusual. Sure, there are lots of projects that are simply too big for a 6-person team to pull off, but every year the range of stuff that can be accomplished by a small, focussed team, gets larger and larger.

There’s a marked difference in the kinds of tools and frameworks that suit ‘small-team’ development and ‘big-team’. Small-team frameworks focus on enabling you to do as much as possible, as quick as possible. The poster-child for a small-team framework is surely Rails, but of course there are many others.
Big-team software is focussed on preventing other people from screwing up your stuff. In a big-team project, individual productivity isn’t that important compared to effective modularisation and decoupling, because you can always just add more programmers to go faster. J2EE is (or at least, was) the ne plus ultra of big-team frameworks, although with JEE5 (and even moreso with 6) it’s making it’s way back towards the little guys.

It’s a bit like the difference between vertical-scaling software (buy the fastest single server you can) and horizontal-scaling (make it possible to run your software on lots of servers)

Now, I have a personal preference for small-team development. Slightly unusually, I also have a preference for java – small teams frequently prefer dynamically typed languages, as they fit closer with the ‘sacrifice safety for speed’ philosophy. The single biggest factor that’s enabled me to square this circle has been the Spring Framework – a set of libraries that give me the ability to get a simple web-based project up and running in next to no time, but knowing that whatever I’m going to need in the future, be it asynchronous message processing, WS-* remoting, distributed cacheing, flexible declarative security,or whatever, it will be available, and it will fit in with everything else.

So, I have to say that I was a teeny bit disappointed when I read about the new SpringSource Application Platform. An application server which is based on OSGi rather than EJB for it’s modularisation.
Now, I don’t doubt for a moment that OSGi is a much better technology than EJB for modularisation. Lots of folk are using it, for humungously complex projects like eclipse, and it works really well.

What irks me, though, is this: why would I want OSGi modules? I’m quite happy with 1 great big WAR file, thanks. Neither I, nor my happy few developers, need the ability to break our app up into little bits, version them and then dynamically lazy-load them. In fact, I think lazy-loading a web app is a terrible idea, and I can do all the modularisation I need to with ivy at build time.

Of course, this is just sour grapes. Big-team developers do want this, and SpringSource have every right to give it to them. It’s just that I’d got kinda used to Spring spending time and effort on what I want, not what those enterprisey guys in suits were after.

Spring started out it’s life as a reaction against the excess baggage that J2EE development entailed. By their own admission, SpringSource have put a lot of time and effort into this product, and doubtless they will need to keep on doing so – and AFAICS that’s time and effort that’s not being spent on making ‘lightweight java’ easier. SpringSource, you have become what you beheld; are you content that you have done right?

* Astute readers will point out that the ‘web development’ team that I run has 12 people in; but in terms of product development it really runs as 3 or 4 2-3 person dev teams, plus a 3-4 person ops/support team (with some overlap, maths fans). We only function as a team of 12 when we need to take over a corner of the pub


October 31, 2007

Netbeans surprises me

Follow-up to Netbeans 5: Still not switching from Secret Plans and Clever Tricks

I’ve never been able to get on with Netbeans as a java IDE. Somehow, if you’re used to Eclipse it’s just too wierd and alien, and things that ought to be simple seem hard. I’m sure that if you’re used to it, it’s very lovely, but I just can’t get started with it.

However, one thing Eclipse is not very good at, IME, is Ruby development. There are plugins, but I’ve never had much success with them; debugging support is patchy-going-on-broken, syntax highlighting / completion is super-basic, and it’s generally only one (small) step up from Emacs with ruby-mode and pabbrev.

(Note that I’m not talking about Rails development here, I’m talking about using Ruby to write stuff that would previously have been done in perl – sysadmin scripts, monitors, little baby apps and so on. Things of a couple of hundred lines or so – nothing very big, but enough that an unadorned text editor is a bit of a struggle.)

There are other Ruby IDEs of course, but they’re almost all (a) OSX specific (b) Windows specific, (c) proprietary, or (d) crap. I’d like something free, that runs on linux, but doesn’t suck, please.

Now, Sun have been making a big noise about their Ruby support generally for about the last 12 months or so, so I thought I’d grab a copy of the Ruby-specific Netbeans 6 bundle and try it out.

And, surprise surprise, it’s really good. Out of the box it almost just works – the only minor hackery I had to do was a manual install of the fastdebug gem, but the error message linked me to a web page explaining what I had to do and why. Debugging works, you can do simple refactorings, syntax highlighting and code completion are reasonably sophisticated. And it looks nice, performs well, and is all fairly intuitive to use, even for a died-in-the-wool eclipse-er like me.

So, three cheers for the Netbeans team, for filling the gaping void in the Ruby IDE space. Development still seems to be pretty active, so hopefully we can expect even more goodness in the months to come.


October 30, 2007

Spring 2.5 web mvc gripe

Spring 2.5 is almost upon us, so I thought I’d grab the RC and have a look at what’s new.

My eye was drawn immediately to the enhancements to the MVC layer; specifically, support for convention-over-configuration and annotation-based configuration. Both of these techniques should help to reduce the yards of XML needed to configure spring web applications (although to be fair, things have been getting better since 2.0 re-worked the config file formats).

Anyway, I started building a little demo, using the sample apps as a template. And came up against an interesting problem almost immediately. Here’s an excerpt from the docs, showing a MultiActionController using annotated configuration:

@Controller
public class ClinicController {

    @RequestMapping("/vets.do")
    public ModelMap vetsHandler() {
        return new ModelMap(this.clinic.getVets());
    }

    @RequestMapping("/owner.do")
    public ModelMap ownerHandler(@RequestParam("ownerId") int ownerId) {
        return new ModelMap(this.clinic.loadOwner(ownerId));
    }

Spot the obvious mistake? There is no need for the @RequestMapping annotation to have to repeat the name of the mapping, if every method follows the same convention. Just take ‘handler” off the end of the method name and use that for the URL. Don’t make me type it twice!
What’s more annoying is that this works in the old-style MultiActionController – but if you go down that route your controller methods have to take HttpServletRequest/Response parameters and you can’t use the lovely new @RequestParam binding annotations. Gah!

If you’re content to have a separate controller per URL, with separate functions for GET/POST (SimpleFormController style), then the convention + annotation based approach works pretty well – so it’s a shame that they couldn’t finish the job and sort out MultiActionController as well; then we could have rails-style create/read/update/delete controllers. Oh well…


October 16, 2007

Writing functional Java with Google Collections

I’ve been experimenting with Google’s new Collections API, which are a kind of type-safe, stripped-down version of Jakarta Commons Collections, providing you with some (though not all) of the list-processing features that are commonplace in more functional languages – like each, collect, detect and inject in Ruby for instance.

In theory, this should give a big win in terms of reducing code complexity and making for a better-decoupled and more testable design.

In practice, this turns out to be true, but with a somewhat unpleasant side effect. Java’s type-checking and lack of support for closures or code blocks means that when you switch to this style of coding, you end up introducing a lot of new little classes, typically anonymous inner classes for things like predicates and functions, which get used once and then thrown away.

For instance; this code has a cyclomatic complexity of 9, which is just barely acceptable. But it’s fairly readable, if you know what the object model looks like.

 for (Content content: page.getContents().values()) {
            for (ContentFetcher cf: content.getContentFetchers()) {
                if (cf instanceof AbstractFileBackedContentFetcher) {
                    String cfFile = ((AbstractFileBackedContentFetcher) cf).getFileName();
                    File directory = new File(rootDir, cfFile).getParentFile();
                    if (directory != null && directory.exists()) {
                        for (File subfile: directory.listFiles()) {
                            if (subfile.isFile() && !filenames.contains(subfile.getAbsolutePath())) {
                                files.add(subfile);
                                filenames.add(subfile.getAbsolutePath());
                            }
                        }
                    }
                }
            }
        }

listifying it, we get something like this. Here the cyclomatic complexity is about 5 – much better – but we’ve had to introduce two new anonymous inner classes, and there are a lot of awfully long lines of code.

        Predicate<AbstractFileBackedContentFetcher> cfDirectoryExists = new Predicate<AbstractFileBackedContentFetcher>() {
            public boolean apply(AbstractFileBackedContentFetcher cf) {
                File dir = new File(rootDir, cf.getFileName()).getParentFile();
                return dir != null && dir.exists();
            }
        };
        FileFilter okToAddFile = new FileFilter() {
            public boolean accept(File pathname) {
                return pathname.isFile() && !filenames.contains(pathname.getAbsolutePath());
            }
        };

        for (Content content: page.getContents().values()) {

            Iterable<AbstractFileBackedContentFetcher> contentFetchersWithExistingDirs = filter(filter(
                    content.getContentFetchers(), AbstractFileBackedContentFetcher.class), cfDirectoryExists);

            for (AbstractFileBackedContentFetcher cf: contentFetchersWithExistingDirs) {
                File[] subfiles = new File(rootDir, cf.getFileName()).getParentFile().listFiles(okToAddFile);
                for (File subfile: subfiles) {
                    files.add(subfile);
                    filenames.add(subfile.getAbsolutePath());
                }
            }
        }

Is that better? I’m not convinced either way. The second block of code is structurally simpler, but it’s about 50% more code. As a general rule, less code is better code...

I think that if you have frequently-used predicates and transformation functions (the instanceOf predicate which GC uses in Iterables.filter(Iterable,class) for instance), then it’s well worth the effort. But if you’re defining a predicate just to avoid having an if inside a foreach, then it’s less clear. Maybe when java 7 comes along and gives us closures (with some syntactic sugar to keep them concise) things will be better.

A related ugliness here (at least, if you’re used to the Ruby way of doing this) is google’s choice not to extend java.util.Iterable, but rather to have the collections methods defined static on the Iterables class. So instead of writing something like

GoogleIterable pages = new GoogleIterable(getPagesList())
files = pages.filter(WebPage.class).filter(undeletedPages).transform(new PagesToFilesTransformation())

we have to do

files = Iterables.transform(Iterables.filter(Iterables.filter(getPagesList(),WebPage.class),undeletedPages),new PagesToFilesTransformation()))

which to my eye is a lot more confusing – it’s harder to see which parameters match up with which methods.

Update I’ve now written my own IternalIterable class, which wraps an Iterable and provides filter(), transform(), inject(), sort() and find() methods that just delegate to google-collections (except for inject() which is all mine :-) ). It’s not very long or complex, and it’s already tidied up some previously ugly code rather nicely.


September 24, 2007

solaris NFS performance wierdness

Spanky new Sun X4600 box. Solaris 10u4. Multipathed e1000g GB interfaces. NFS-mounted volume, totally default.

$nfsstat -m /package/orabackup
/package/orabackup from nike:/vol/orabackup/dionysus-sbr
 Flags:         vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

$ /opt/filebench/bin/filebench
filebench> load webserver
filebench> run 60
IO Summary:      255700 ops 4238.6 ops/s, (1366/138 r/w)  23.1mb/s,   4894us cpu/op,  66.0ms latency

mutter, grumble,... remount the NFS vol with -v3

$ nfsstat -m /package/orabackup
/package/orabackup from nike:/vol/orabackup/dionysus-sbr
 Flags:         vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

$ /opt/filebench/bin/filebench
filebench> load webserver
filebench> run 60

IO Summary:      4397877 ops 72839.3 ops/s, (23495/2351 r/w) 396.4mb/s,    221us cpu/op,   3.1ms latency

What the … ? The default configuration for an NFS 4 mount on this box appears to be 20 times slower than the equivalent V3 mount. How can this be right? Either there’s something very wierd going on with our network topology, or there’s something badly broken about the way the mount is configured. Either way, it’s beyond me to work out what it is. NFS 3 ain’t broken (well, not very) so, unless Sun support can offer some illumination we’ll be sticking with that.


August 29, 2007

wishlist #2: Sun iLOM RConsole over ssh

Another thing that would make my life easier; Sun have a fantastic java GUI console redirection tool that you can use with their iLOM to get the system graphical console output. But it works by communicating over a variety of ports that our firewall doesn’t allow, so I can’t use it from home.

Surely there should be an option somewhere to tunnel it all over ssh instead? There must be an easier solution than VNC-ing (over ssh) onto my desktop at work and running the rconsole from there?

(Yes, I could ssh straight in to the ilom and get the text-only console. But it seems a shame to let the lovely graphical version go to waste…)


Wishlist: thread–safe hibernate sessions

Hibernate is lovely, but I wish that it would provide me a thread-safe Session implementation. We’ve got a couple of servers which are more or less as fast as you can buy, in terms of single-threaded performance, and a few points in our app which could be faster. Now, these servers have 8 cores each, and typically only have one or two threads in the run queue at any given time; i.e. 6 of the cores are basically idle. So it would make a lot of sense if we could take a request, split it into 4 components, and distribute those components over 4 separate threads. If we could do that without too much overhead, we could actually switch from our relatively power-greedy opteron chips, to something a bit more frugal like a Sun T1, which would be nice.

However, our apps are processing hibernate persistent objects, which means that they need a session to operate within, which means they are bound to a single thread.

We could create 4 threads and give each one a new session, but that means that each thread will need to re-query for it’s data, since the hibernate L1 cache is bound to the session. So that won’t work. We could deploy a level 2 cache, but they we have to somehow manage to invalidate all the data that a particular request has loaded without affecting other, concurrent requests (we have no shared cache between requests, so that (a) we don’t need much heap, and (b) we can loadbalance between multiple machines and JVMs without needing to share the cache.

So we’re stuck in an awkward position. We can’t multithread the app any further, because all the performance we’d gain by spreading processing over more cores, we’d loose by having to re-read the same data over and again from the DB.

What I really want, I guess, is a Level 1.5 cache – something that’s bound to the operation rather than the session, and that can be shared between multiple threads which are co-operating to do all of the processing that’s needed. Alas, it seems such a thing doesn’t exist.

Update It would seem I’m not alone. “Uncle” Bob Martin blogs about exactly the same problem, albeit at a slightly higher level of abstraction.


May 22, 2007

Three cheers for the Fair Share Scheduler

Writing about web page http://www.sun.com/software/solaris/utilization.jsp

The more I use my Solaris Zones boxes, the more (mostly) I like them. Yes, there are some niggles about how you cope with container failures, how you clone zones between boxes, the odd un-killable process, and so on, but for the most part, they just do exactly what you’d expect them to, all the time.

Take the FSS for instance. This little widget takes care of allocating CPU between your zones. A big problem in server consolidation, at least in web-land, is the “spiky” nature of CPU usage; Web-apps tend to be relatively low consumers of CPU most of the time, but occasionally will want a much larger amount.
If you’re consolidating apps, you don’t want one busy app to steal CPU off all the others, but if all the others are idle, then you might as well let the busy app take whatever it needs.

The FSS solves this problem elegantly. Each zone is allocated “shares” representing the minimum proportion of CPU that it is able to allocate if needed. So if I have 3 zones, and give them each 1 share, then if each zone is working flat out, they’ll get 1/3 of the CPU time allocated. But if one zone goes idle, the other two will get 50% each. If only one zone is busy, it’ll get 100%. Better still, if one zone has 100% of CPU, and another zone becomes busy, the first is reined in instantly to give the other one the CPU it’s entitled to.

And does it work in real life? Oh yes…here’s one of our apps getting a bit overloaded. You can see the box load go up to 20 (on an 8-way box; in this case it was about 99% busy for 15-20 minutes), and the zone that’s causing all the trouble gets pretty unresponsive. But the other zone doesn’t even register the extra CPU load. Awesome.

maia cpu load

maia forums response timemaia rtv response time


March 19, 2007

Twitter: Not just yet

Follow-up to Twitter from Secret Plans and Clever Tricks

So, I’ve tried twitter for a few days, and my views are; It’s not ready yet.

I love the idea of twitter, which works in 2 ways:
  • short blog entries – If you’re only blogging 140 characters at a time, you can do so much more informally. If blog entries are journals, twitters are more like the scribbles in the margins. Whether or not you get any benefit from the ‘social’ side of things, these little notes have some value, and I really enjoy creating them.
  • the social thing. Well, I’ve never really had a chance to test this out, because none of my friends are on twitter.

Now, of course I could solve this problem by telling some of my friends about this great new service, but the problem is that it’s not a great service. I couldn’t recommend to a friend that they start using it, because the performance is awful. The load time for the ‘post a new message’, from my uber-fat-pipe connection over JANET, is somewhere between 10 seconds and a minute. Early in the morning (when the US is asleep) you can sometimes get it to load in less than that, but in the evenings, it’s often even worse.

Look: right now (8pm GMT)

cusaab:~$ time curl http://twitter.com/chrismay > /dev/null
[...]
real    1m12.486s

1’12”. Not great. one-off?

cusaab:~$ time curl http://twitter.com/chrismay > /dev/null
[...]
real    0m53.920s

sigh…
If I recommended this to anyone I knew, I’d doubtless get some puzzled emails back again asking whether maybe I’d misquoted the URL? The IM interface has been up and down, mostly down, over the last few days, and I haven’t felt like committing my money to the SMS interface, given the shoddiness of everything else.

So, I’ll be keeping an eye on twitter, but until they manage to sort out their scaling issues, then I can’t see me updating it very often. Which is a shame, because I’d really like to be able to get more out of it. But for now I think I’ll go back to keeping my scraps in google notebook. Oh well…


February 20, 2007

Spring and the golden XML hammer

Writing about web page http://www.theserverside.com/tt/articles/article.tss?l=SpringLoadedObserverPattern

This article describes as best practice, one of the things that I’m really coming to dislike about the Spring Framework – the tendency to use XML for object construction for no better reason than ‘because I can’.

Now, I love spring; It’s revolutionised the way I, and many others, write code, and for the better. But it does have a tendency to produce reams of XML. As a data format, I think XML is OK. It’s precise, and the tooling is good, though it’s a good deal more verbose than something like JSON or YAML, which, IMO, have 80% of the functionality with 20% of the overhead.

For aspects of an application which are genuinely configuration, such as the mapping of URLs to controllers, or configuration of persistence contexts, XML is better than code; no doubt about it. For the construction of object graphs, XML is sometimes better than code. But this example is just pushing it too far. It describes setting up an observer/observable pair, using the side-effects of spring’s MethodInvokingFactoryBean to call the addListener() method, rather than doing it in code.

Now, this is just clunky. Instead of one line of code that says

townCrier.addListener(townResident);

we have this

<bean id="registerTownResident1" 
class="org.springframework.beans.factory.config.MethodInvokingFactoryBean">
    <property name="targetObject"><ref local="townCrier"/></property>
    <property name="targetMethod"><value>addListener</value></property>
    <property name="arguments">
    <list>
      <ref bean="townResident1"/>
    </list>
    </property>
</bean>

Ten lines of XML. No static type-checking (I hope you’ve got a bunch of tests that verify your contexts…) The addListener invocation, the thing we’re trying to achieve here, is kind of buried; the bean that’s actually generated is never used, the whole thing is far from obvious in it’s intent.

The only notional advantage I can see is that you can add and remove listeners without touching the code. But how much of an advantage is that? In most situations, where you’re using a method-invoking synchronous observer/subject pattern like this, listeners are part of the application, and not part of the configuration; you wouldn’t remove one without first consulting a developer anyway. When you’ve got genuinely replaceable listeners, then it’s more common IME to have some kind of an abstraction like a JMS queue or a message bus in between subject and listener, so that the listeners are registered with the queue, not the subject itself.

If it were up to me, I’d probably have a class called when the context is built (via an ApplicationListener maybe), which explicitly built up the subject/observer relations. If I had some configurable relationships, I might pass in a list of observers, but that’s about as far as it would go;

\\ set by IOC
setChangeEventListeners(List<ChangeEventListener> listeners){
   this.changeListenersToRegister = listeners;
}
onContextRefreshed(){

  \\ configure a subject with a list of observers
  \\\
   for (ChangeEventListener listener : this.changeListenersToRegister){
       this.changeEventBroadcaster.addListener(listener);
   }

   \\ now hard-code a subject that won't need to change frequently
   auditLog.addListener(new log4j.Category("AUDIT_LOG");

  \\ ... and so on  
}

- this object starts to look a bit vague and ill-defined, doing a little with lots of objects, but that’s because really it’s just a part of the context/configuration; it’s not a part of the domain per se.

There are a few other options that, in some situations, might be better than this;

  • Give the subject a constructor that takes a list of observers, and let it wire them at construction time – then pass the list from within your XML context
  • If you can’t modify the subject itself, make a custom FactoryBean that takes the list of observers, constructs the subject and adds all the observers to it
  • One that requires a bit of divergence from the standard Spring usage. Have a context that’s defined by a bit of scripting code – JRuby, or BSH, or javascript/rhino, rather than by XML. That way you make your method calls more explicit, and allow developers to easily see what relationships are being built up , whilst still keeping some clear separation between the configuration and the java code. If you had loads of Observer/subject configuration to maintain, you could define a little DSL for it (or store it in a database) and have a custom context to parse the DSL and configure the beans.

January 19, 2007

New tools

I’ve been playing with some new bits of technology. Not very new, but new to me, anyway. JSON and BeautifulSoup.

BeautifulSoup is a Python library, now ported into all good dynamic languages (I’m using the Ruby version), which parses HTML. It’s defining feature is that it’s very relaxed about well-formed-ness. If your markup is fully-validating XHTML, all good. If it’s horrible HTML 3.2 tag soup with unbalanced divs and unclosed tables, that’s cool too. Soup will make a pretty good job, parsing what it can with a DOM, and falling back to regexes, special-cases, and hacks for the rest. Having parsed the markup, it gives you a nice DOM tree, which you can traverse or search as you’d expect.

I’m using it to scrape some information from a webpage, which I then expose as data via a web service, which is where the JSON bit comes in. JSON is a data-transfer language, functionally equivalent to XML, but expressed as Javascript arrays and hashes.
So rather than having to parse a heap of XML, which is awkward and platform-dependent in javascript, you can just eval the JSON string (escaping as needed if you don’t trust your source), and have a pre-loaded object graph spring into existence.

My JSON is loaded after page-loading via a prototype Ajax.Request. The onComplete function evals the JSON, then sets up the innerHTML for a div, based on the objects it got back. It’s really very straightforward – even for a javascript newbie like me.


December 08, 2006

Solaris SMF manifest for a multi–instance jboss service

Today I have mostly been writing SMF manifests. We typically run several JBoss instances per physical server (or zone), using the JBoss service binding framework to take care of port allocations. I couldn’t find a decent SMF manifest that would be safe to use in a context where you’ve got lots of JBosses running, so I wrote my own. Here it is…

It’s still a tad rough around the edges.
  • It assumes you’ll name your SMF instances the same as your JBoss server instances
  • The RMI port for shutdowns is specified as a per-instance property – in theory one could parse it out of the service bindings file, but doing that robustly is just too much like hard work at the moment.
  • It assumes that you’ll want to run the service as a user called jboss, whose primary group is webservd – adjust to suit.
  • The jvm_opts instance property allows you to pass specific options (for example, heap size) into the JVM
  • It assumes that you’ll have a log directory per instance, located in /var/jboss/log/{instance name}-{rmi port}. The PID file is stored there, and the temp. file dir is set to there too (using /tmp for temporary files is a bad idea if you hoover your temp dir periodically, as you’ll delete useful stuff)
  • The stop method waits for the java process to terminate (otherwise restart won’t work. The start method doesn’t wait for the server to be ready and to have opened it’s HTTP listener, just for the VM to be created. I might add that next, although given that svcadm invocations are asynchronous there doesn’t seem much point.

The manifest itself:

<?xml version='1.0'?>
<!DOCTYPE service_bundle SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle type='manifest' name='export'>
  <service name='application/jboss' type='service' version='0'>
    <instance name='default' enabled='true'>
      <dependency name='network' grouping='require_all' restart_on='error' type='service'>
        <service_fmri value='svc:/milestone/network:default'/>
      </dependency>
      <dependency name='sysconfig' grouping='require_all' restart_on='error' type='service'>
        <service_fmri value='svc:/milestone/sysconfig:default'/>
      </dependency>
      <dependency name='fs-local' grouping='require_all' restart_on='error' type='service'>
        <service_fmri value='svc:/system/filesystem/local:default'/>
      </dependency>
      <exec_method name='start' type='method' exec='/usr/local/jboss/bin/svc-jboss start' timeout_seconds='180'>
        <method_context>
            <method_credential user='jboss' group='webservd' />
        </method_context>
      </exec_method>
      <exec_method name='stop' type='method' exec='/usr/local/jboss/bin/svc-jboss stop' timeout_seconds='180'>
        <method_context/>
      </exec_method>
      <property_group name='jboss' type='application'>
        <propval name='instance-rmi-port' type='astring' value='1099'/>
        <propval name='jvm-opts' type='astring' value='-server -Xmx1G -Xms1G'/>
      </property_group>
    </instance>
    <stability value='Evolving'/>
    <template>
      <common_name>
        <loctext xml:lang='C'>JBoss J2EE application server</loctext>
      </common_name>
    </template>
  </service>
</service_bundle>

... and the service method

#!/usr/bin/sh
#

. /lib/svc/share/smf_include.sh

# General config
#
JAVA_HOME=/usr/java/
JBOSS_HOME=/usr/local/jboss
JBOSS_CONSOLE=/dev/null

# instance-specific stuff:
# sed the instance name out of the FMRI
JBOSS_SERVICE=`echo $SMF_FMRI | sed 's/.*:\(.*\)/\1/'`
JBOSS_SERVICE_RMI_PORT=`svcprop -p jboss/instance-rmi-port $SMF_FMRI`
SERVICE_JVM_OPTS=`svcprop -p jboss/jvm-opts $SMF_FMRI`

# Derived stuff
#
JBOSS_VAR=/var/jboss/jboss-3.2.7/${JBOSS_SERVICE}-${JBOSS_SERVICE_RMI_PORT}
PIDFILE=${JBOSS_VAR}/JBOSS_${JBOSS_SERVICE}.PID
JAVA=${JAVA_HOME}/bin/java
JAVA_OPTS="-Djava.io.tmpdir=${JBOSS_VAR} -Djava.awt.headless=true" 

if [ -z "$SMF_FMRI" ]; then
        echo "JBOSS startup script must be run via the SMF framework" 
        exit $SMF_EXIT_ERR_NOSMF
fi

if [ -z "$JBOSS_SERVICE" ]; then
        echo "Unable to parse service name from SMF FRMI $SMF_FRMI" 
        exit $SMF_EXIT_ERR_NOSMF
fi

jboss_start(){
        echo "starting jboss.." 
        JBOSS_CLASSPATH=${JBOSS_HOME}/bin/run.jar:${JAVA_HOME}/lib/tools.jar
        if [ ! -z "$SERVICE_JVM_OPTS" ]; then
           JAVA_OPTS="${JAVA_OPTS} ${SERVICE_JVM_OPTS}" 
        fi

        $JAVA -classpath $JBOSS_CLASSPATH $JAVA_OPTS $SERVICE_JVM_OPTS org.jboss.Main -c ${JBOSS_SERVICE} >$JBOSS_CONSOLE 2>&1 & echo $! >${PIDFILE}
}

jboss_stop(){
        echo "stopping jboss.." 
        stop_service="--server=localhost:${JBOSS_SERVICE_RMI_PORT}" 
        JBOSS_CLASSPATH=${JBOSS_HOME}/bin/shutdown.jar:${JBOSS_HOME}/client/jnet.jar
        $JAVA -classpath $JBOSS_CLASSPATH org.jboss.Shutdown $stop_service
        PID=`cat ${PIDFILE}`
        echo "waiting for termination of process $PID ..." 
        pwait $PID
        rm $PIDFILE
}

case $1 in
'start')
        jboss_start
        ;;

'stop')
        jboss_stop
;;

'restart')
        echo "Restarting jboss" 
        jboss_stop
        jboss_start
        ;;

*)
        echo "Usage: $0 { start | stop | restart }" 
        exit 1
        ;;
esac

        

enjoy!

postscript I wrote above that parsing the service-bindings file to find the RMI port is too hard; this turns out not to be true. Praise be to Blastwave!

pkg-get install xmlstarlet

xml sel -t -v "/service-bindings/server[@name='${INSTANCE_NAME}']/service-config[@name='jboss:service=Naming']/binding/@port" service-bindings.xml 

November 22, 2006

Tuning Java 5 garbage collection for mixed loads

Once again, I find myself glaring balefully at the output of garbage collection logs and wondering where my CPU is going. Sitebuilder2 has a very different GC profile to most of our apps, and whilst it’s not causing user-visible problems, it’s always good to have these things under control.

So, SB2 has an interesting set of requirements. Simplistically, we can say it does 3 things:

1) Serve HTML pages to users
2) Serve files to users
3) Let users edit HTML/Files/etc

these 3 things have interestingly different characteristics. HTML requests generate a moderate amount of garbage, but almost always execute much quicker than the gap between minor collections. So, in principle, as long as our young generation is big enough we should get hardly any old gen. garbage from them. Additionally, HTML requests need to execute quickly, else users will get bored and go elsewhere.

Requests for small files are rather similar to the HTML requests, but most of our file serving time is spent drip-feeding whacking great files (10MB and up) to slow clients. This kind of file-serving generates quite a lot of garbage, and it looks as if a lot of it sticks around for long enough that it ends up in the old gen. Certainly the requests themselves take much longer than the time between minor collects, so any objects which have a lifetime of the HTTP request will end up as heap garbage. Large file serving, though, is mostly unaffected by the odd GC pause. If your 50MB download hangs for a second or two halfway through, you most likely won’t notice.

Edit requests are a bit of a mishmash. Some are short and handle only a little data, others (uploading the aforementioned big files, for instance) are much longer running. But again, the odd pause here and there doesn’t really matter. There are orders of magnitude fewer edit requests than page/file views.

So, the VM is in something of a quandry. It needs a large heap to manage the large amounts of garbage generated from having multiple file serving requests going on at any given time. And it needs to minimise the number of Full GCs so as to minimise pauses for the HTML server. But, the cost of doing a minor collection goes as a function of the amount of old generation allocated, so a big, full heap implies a lot of CPU sucked up by the (parallel) minor collectors. It also means longer-running minor collections, and a greater chance of an unsuccessful minor collect, leading to a full GC.
(For reference, on our 8-way (4 proc) opteron box, a minor collect takes about 0.05s with 100MB of heap allocated, and about 0.7S with 1GB of heap allocated)

So, an obvious solution presents itself. Divide and Conquer.

Have a VM (or several) dedicated to serving HTML. These should have a small heap, and a large young generation, so that parallel GCs are generally fast, and even a full collection is not going to take too long. This VM will be very consistent, since pauses should be minimal.

Secondly, have a VM for serving big files. This needs a relatively big heap, but it can be instructed to do full GCs fairly frequently to keep things under control. There will be the occasional pause, but it doesn’t matter too much. Minor collections on this box will become rather irrelevant, since most requests will outlive the minor GC interval.

Finally, have a VM for edit sessions. This needs a whacking big heap, but it can tolerate pauses as and when required. Since the frequency of editor operations is low, the frequency of minor collects (and hence their CPU overhead) is also low.

The only downside is that we go from having 2 active app server instances to 6 (each function has a pair of VMs so we can take one down without affecting service). But that really only represents a few extra hundred MB of memory footprint, and a couple of dozen more threads on the box. It should, I hope, be a worthwhile trade off.


October 07, 2006

Edgy trivia

  • If you want a kernel post 2.6.17.7 to boot into X on a Toshiba laptop, modify /etc/modprobe.d/toshiba_acpi.modprobe so it says
    options toshiba_acpi hotkeys_over_acpi=0

(toshiba hotkey support is broken in recent kernels)
update fixed in 2.6.17.10.29 :-)

  • Flashplayer-nonfree works on XGL again! Just make sure you’ve got 24-bit colour and all the latest xserver-xorg packages, and it’s all good. Ah, youtube, I’ve missed you…

September 19, 2006

Unicode, UCS, and encodings

About once a year or so I come up against some multi-byte text issue that requires me to learn a bit more about how unicode works. Each time, I come away from it thinking ‘yeah; I understand this now’. So far this has always turned out to be a bit wrong. However, I’ve moved forwards a little bit today, by coming to grips with what the UCS is.

This episode was sparked by a question from Hongfeng, who asked ‘how come our web pages are served as iso-8859-1, but they can still display Chinese characters?’

It’s a good question; iso-8859-1 is a very small and parochial encoding, designed specifically for efficient coding of western european text. It specifies encodings for the letters in the roman alphabet, plus their accented versions, and common symbols. But, unlike a multibyte encoding like UTF-16, it doesn’t know anything about non-european alphabets. So when the browser sees a sequence like ‘& # 1488’ (minus spaces ) how does it know that it should convert that into an ‘aleph’ (א) glyph?

The answer is the UCS . The UCS is the uber-charset; a set of about a hundred thousand characters in every alpabet from Klingon to Olde English. Each character in each alphabet is given a unique number in the UCS, so there’s never any question of people disagreeing on whether 1488=’B’ or 1488=’aleph’. (though it’s quite possible that two code points might refer to two indistinguishable glyphs)

iso-8859-1, utf-8, utf-16, and the other encodings, are just subsets of the UCS which are optimised for a particular kind of text. So if you want to send a page of french text, then you can do it in a lot less bytes if you encode into iso-8859-1 than if you use UTF-*. But HTML gives you the option of specifying a raw UCS code point, via the &#{number}; notation. So as soon as you come up against a character that’s not in your target charset, just look up it’s UCS codepoint and encode away.

UCS is therefore somewhat more reliable than using high-byte characters. For instance, if you want to use ‘smart quotes’ in XHTML you can either use the UCS code points 8220 and 8221, or the UTF-8 encodings c293 and c294, or the HTML entities &ldquo and &rdquo. But if you use the HTML entities you won’t be able to parse your content as XML unless you predefine the entities wherever your markup is to be used; if you use the utf version you won’t be able to use your markup on a page which isn’t utf-8 (which is a pain when you’re syndicating other people’s data). If you use the UCS version, it might cost you an extra byte per character, but it’s universally re-useable. Which is good.

A couple of useful references for more info:

September 12, 2006

Wierd Apache / mod_jk / Jboss keepalive bug

A curious bug manifested itself on our production web server yesterday. The server runs apache, with MaxClients=500, and handles about 3 million hits per day. About 80% of requests are for static files, and the remainder are dynamic, handled by mod_jk delegating to 2 jboss instances (round-robin load-balancing), each with 250 tomcat worker threads.

Now, round about mid-morning, we got a fairly big spike in the number of requests per second coming through, and very quickly the site became unresponsive. Apache reported about half of it’s workers being used for keepalives, and the remainder waiting on a response from jboss.

Jboss reported all of it’s mod_jk worker processes were busy, but only a very small number were actually involved in servicing requests, most were in a ‘K’ state (keepalive). The load on the box was very light (only a few % CPU) and testing via the jboss HTTP interface (which was unaffected by mod_jk) suggested that there were no problems actually handling requests.

Now, a mod_jk/AJP keep-alive, as I understand it, isn’t supposed to work like an HTTP keepalive. An apache worker that’s in a keepalive state is reserved for the exclusive use of the client (browser) that’s connecting to it. If you have too many distinct clients connecting and holding keepalives, you’ll run out of httpds. But, an AJP keepalive is a keepalive between the apache server and the jboss server; although it knows the client IP address, it’s not limited to only serving requests from that client. Even if it is, it should recognize when apache terminates the keepalive connection with the client (after 15 seconds or 100 requests in our case) and make itself available for other client connections.

However, in our setup that didn’t seem to be the case. It seems as if, once an AJP connection is marked as holding a keepalive to a client IP address, it will not service requests from any other IP address for some relatively long period of time (Not infinite, but much longer than apache, anyway). The result of this is that it doesn’t take long under load for all of your worker threads to be tied to particular clients, waiting for the next request.

We solved the problem, rather cludgily, by simply disabling client keepalives in the web server. This makes the process of rendering pages slightly slower (since each request must set up and tear down a TCP connection for every image, stylesheet, and other static resource the page references) but it’s not really noticable. It’s had the additional benefit that our apache server has gone from having about 250 active httpds on average to about 50.

A much better solution (apart from fixing mod_jk), IMO, would be to ditch mod_jk in favour of a connectionless HTTP load balancer like haproxy. Then we could re-enable keepalives from the web server to the client, enabling fast loads for the 80% of static content, but disable them from the web server to the JBoss server, thus preventing Jboss from spawning too many threads which are just sitting idle waiting on a client keep-alive. Plus, switching to HTTP would allow us to do funky pipeline things like sticking a Squid cache between web server and JBoss, to further speed things up. Additionally, the overhead of TCP connection establishment, whilst it might be significant on a static request for a 2K stylesheet, simply doesn’t figure in the time to render a page with 30-odd database queries and a stack of java code behind it.


August 01, 2006

Creeping Statefullness

A while ago, when we were designing Sitebuilder 2, one of our design goals was that the app server should be as stateless as it possibly could be. We wanted to end up in a situation where we could scale the app simply by bringing new servers online, with no need to replicate between them. We also wanted to be able to bring individual servers down for maintenance without users noticing, simply redirecting requests onto the remaining servers.

As far as the viewing of pages is concerned, this has worked out pretty well. There are occasional blips when our version of mod_jk fails to realise that a server has dropped out of the cluster, but they're rare and we could work around them quite easily if need be.

But for editing we haven't quite realised our goals. It started out very well, but we were seduced by a bit of technology that nearly did what we needed, but not quite. That technology is based on Spring Web Flow .
What SWF does (amongst other things) is to let you associate an arbitrary bunch of objects with a business process. So if you were, say, uploading and unpacking a zip file onto the server, then the process might include 3 steps – you upload the file, then you choose which files go where, then you're told which ones were successfully uploaded and which weren't.

There's quite a bit of state associated with that process, and SWF does kind of solve the problem of how you could do step 1 on server 1, step 2 on server 2, and step 3 on server 3. It does this by using 'client continuations' – basically, all the server–side objects needed for the process are serialized and the resulting ObjectOutputStream is written into a field on the form, and hence re–submitted when the user processes the next step.

So far so good. But the first hurdle comes when you've got a lot of server-side state - like, say, a 200MB zip file full of MP3s. If you try serializing that back to the client, you'll have a lot of network IO, plus you'll have to post all of your forms as multiparts, which is a bit bogus.
So, when we serialized our objects, we wrote all the files out to some shared file storage, so that any node in the cluster could pick them up (for purposes of disaster planning, our shared storage never fails ;–) )

So far so good; now we have clients with almost all of the server–side state they need to continue the process, and the rest of the state is shared amongst all the nodes.

But spotting the files to store on the server is kind of tricky; sometimes they're buried at the bottom of an object graph in hard–to–find places. So some clever chap hit upon the realisation that, if we're relying on a shared file system for some of our state, we might as well rely on it for all of the state. So instead of serializing the objects and sending them back to the client, we serialize them all to disk, and just send the client a pointer to the file on disk. All good ?

Well, then along comes the next problem. Someone adds a non–serializable attribute to one of these objects. Of course, it's buried at the bottom of a huge graph of objects whose main job is something compeltely different to holding conversational state, so no–one spots until it gets deployed live, and suddenly all kinds of edit operations are throwing NotSerializableExceptions. Great. Sorry, everyone.

So, we write some mildly heroic custom infrastructure that looks through the objects it's about to serialize, spots any non–serializable classes, and calls a special beforeSerialize / afterDeserialize hook to allow the object to convert itself properly, passing it whatever service objects it might need.

Now is it all fixed? Well, no, actually it's not. Because if we release a new version of the code, and the new version changes one of the serialized classes, then anyone who's in the middle of an editing process when we release the code, is going to find that their nicely serialized state is no longer compatible, and is going to staring at something like local class incompatible: stream classdesc serialVersionUID = -6586187098630577013, local class serialVersionUID = -8144714009347234947 . Great.

There has got to be a better solution to this; and I can't help thinking that it probably just involves a form with a big bunch'o hidden fields with all the previously–submitted data in. Oh well, back to the old skool we go…


June 29, 2006

How to look an idiot in 1 easy step

(I debated whether this was one of those things better kept a secret, but in the end I decided it was more fun to share…)

So, we have a little system (based on Nagios ) that we use for monitoring all the various servers and applications that we run in webdev. It tracks the status of about 60 assorted web services of one sort and another and, amongst other things, keeps logs of performance. Like this.

Now, a couple of weeks ago I noticed that some of our services seemed to have got a bit more 'spiky'. Not by much – about 20 milliseconds or so more latency, and not terribly consistent.

I didn't pay it too much attention – it wasn't enough to be visible to end users, and we often get short periods of lag like this when then network's playing up.

Then today, it suddenly stopped, and the graph went flat again. Curious, I thought. So I rang up a colleague in the network team, and asked if they'd changed anything that might have improved performance for us. They hadn't, but requested a bunch of extra information so they could do some more diagnostics.

I started gathering some more data, and only then noticed that all of our services seemed to have got 20ms faster at almost exactly the same time, 10am. Surely they must have done something to the network at westwood (where the monitor is) to have affected so many boxes at once?

Then suddenly I had one of those 'oh bugger' moments, as I recalled going into our machine room at about 9:45 that morning, and noticing that someone (probably me) had left a console logged in to the monitoring server, which was now happily spinning away a screensaver. I logged it out at, yes, exactly 10 am.

So all the lag was in fact nothing whatsover to do with the network, it was in fact the monitoring server trying to task–switch between an openGL screensaver and a timed TCP connection. Ooooopie. Time for a hasty message to the network team to ask them not to look at it any further, and to apologise for wasting their time.

And the moral of this story? There isn't one really. 'Don't run screensavers on your production boxes' is a bit too obvious, isn't it? In my defence, I'd point out that the CPU load that the screensaver imposes is tiny – just a couple of percent – but it's a single–CPU box, and the cost of context–switching back and forth between X and the monitoring process is, oh, I don't know, about 20 milliseconds?

sigh…


June 16, 2006

Day 2 Session 5: What's new in Spring 2

  • extensible simplified configuration

util:map and util:list look especially helpful

  • AOP

aop:* tags, AspectJ pointcut language and @Aspect annotation support. Very easy to add AOP functionality

  • MVC updates + SWF
    convention over configuration in MAC, form tag library, Portlet MVC framework

  • Simplified TX
    tx:advice + aop:advisor or tx:annotation + @Transactional

  • Task Executor

– execute some arbitrary runnable, sync, asynch, Thread pooling etc.

  • Asynch JMS

MessageListenerContainer – wraps POJOs and binds to a MessageListener

  • Data Access
    JPA support – make it easy to switch betwen providers, fill in the gaps in the spec. @Repository tag automagically (AOP) does exception translation for you on DAOs. AbstractJPATest allows you to inject EntityManagerFactory, DataSource and TXManager into your tests

  • Multilanguage support

– groovy, jruby, bsh, etc


Day 2 Session 4: Spring usage patterns

Principles

  • SoC
  • Multi–tier applications + layered applications

Component

  • loosely coupled components allow for re–use throughout the enterprise (I think this is a dodgy assertion; code re-use strategies are unsuccessful more often than not )
  • Spring perspective: components are just POJOs – technology neutral and therefore more likely to survive technology shifts. Still requires careful thought to achive reuse though

Container patterns:

  • Container provides the technical concerns which have been separated out from the functional components
  • DI/IoC container used to be clearly separated from app. server, but the distinction is becoming more blurred as spring picks up TaskExecutor, ShadowingClassLoader etc.

Component Interface

  • Component clients only access interfaces, not implementations
  • In spring can use a standard java class's public methods; no need for anything heavier weight unless the domain model requires it

Component Home

  • Locator for obtaining configured components
  • Largely just disappeared in Spring; DI means that clients don't need to be aware of the BeanFactory

Virtual Instance

  • Stub instances for entities that exist within the system that aren't needed right now; pooling, passivation, etc.
  • Spring support for this is more flexible (not restricted to session beans, can be any expensive object), but rarely used;

Interception

  • Container adds code to components by intercepting requests
  • Spring AOP is much more powerful

Lifecycle callback

  • have to be able to tell a component to change it's logical identity when used, also to clean up resources on exit
  • Spring lifecyle is much simpler than EBJ2: Just have startup and shutdown events, because most of the complexity went away with virtual instances

Spring API abstractions

  • ExceptionTranslator – convert checked or unhelpful exceptions into sensible ones
  • Template – manage acquisition and release of resources before and after business code
  • Exporter – Adapter to API–required classes e.g. the JMXExporter converts an arbitrary bean into an MBean
  • Proxy – Adapter from the (remote) return object of an API call, proxies the object and facades away checked exceptions and so on. (Exporter and proxy are 2 sides of the same thing)


Day 2 Session 3: Unit testing with spring

– Testing == good. Making test run fast == good. Automated tests == good
– Dont' use the container in your unit test; that's an integration test. Unit tests should be almost instant and they should not require elaborate setup.
– Mocks are helpful but not a silver bullet.

– Some things can't be sensibly unit–tested: configuration; JDBC code; OR mapping, and (of course) how classes work together. Also need to think about how we test non–java artefacts like DB schemas, config / mapping files, JSPs / other views.

Some stuff isn't worth testing: Hibernate doesn't leak connections and there's no point having a test to prove it. isolate the code that doesn't need testing or can't be tested into DAOs or other similar service objects.

Out–of–container testing is vital – much faster, easier to debug, easier to run individual tests. Spring integration testing is a good alternative. (org.springframework.test, in spring–mock.jar)

Spring integration testing provides context loading / caching, DI, (n.b. AutoWireUtils in spring 2) data access and TX management

Neat tip for TDD: configure the eclipse template for a new method to throw UnsupportedOperationException


Day 2 Session 2: Patterns of SOA

[I switched to this session from the 'introduction to Spring 2 because Gregor gave such an engaging keynote]

SOA is a shift in the way that we think about assembing systems – thinking about coupling, asynchrony, conversations and document exchange. It's really about reducing the degree of control one system exerts on another.

– Exists at a level of abstraction above the code; your compiler can't tell you if you're violating SOA principles

– Looks conceptually simple, but the devil is in the details; object–>document mapping is as hard as ORM; declarative programming and document transformation is hard to maintain; event–based programming; and business process modelling is hard when you try to do a complete design (long–running–processes & conpensating transactions (since 2PC isn't really applicable)

– Graphical tools tend to be a thin veneer over a complex / unfamiliar programming paradigm
– Understanding technology – process is something like {syntax=>constructs(e.g. in java classes, objects, interfaces etc)=>principles(SoC, open–closed principle)=>patterns}
– patterns are expressed in the constructs of the language, to support the principles of the language
SOA patterns are different to OO patterns because the constructs &vocabulary of SOA are not those of OO

– request–reply pattern – starts out very simple, but gets a bit more complex fairly quickly (return–address and correlation–id required to locate endpoints and re–produce state if you have more than 1 producer / consumer) It's possible to use a pattern like this in reverse; i.e. if you don't have a correlation ID, then you can't have multiple provider; if you don't have a return address you can't have multiple consumers. When you introdroduce retries, it gets more complex again as you move into heuristics about how long to keep trying.

– dynamic discovery – broadcast request for service, providers respond, requestor chooses 'best' provider from response, requestor starts using provider. DHCP is a good example of this.

– Subscribe–notify – subscriber expresses an interest in receiving notifications, provider starts sending messages; subscriber does not reply, after some event the provider stops sending messages and notifies subscriber that it is done.

– sub–patterns for the end refresh e.g. automatic expiration (e.g. DHCP lease&renew), renewal request (e.g. magazine model). Generally try to avoid allowing the subscriber to control when the subscription ends because it doesn't cope with the subscriber going away

Orchestration: a facade co–ordinating the calling of multiple services (which may in turn orchestrate their own conversations)
Orchestration patterns: switches: XOR, OR, ;merges: synchronized, multiple, discriminator;
(not all orchestration environments support all orchestration patterns)

Patterns and standards interact; standards tell you how to express a pattern in code; patterns act as the requirements for the standard. There's a feedback cycle between patterns and standards; but it happens very slowly when you're dealing with cross–organisational standards.