January 31, 2006

Unsupportable systems

How do systems become unsupportable? If you're planning to implement an email system, or a web site, or a CRM package, or just about any sort of software really, then you probably intend for your software to be robust, secure, continuously available, performant and all those other words that basically come down to meaning "does what it's supposed to".

And yet the long-term trend in a surprisingly large number of cases seems to be something different. Systems get slower, less reliable, less predictable as time passes. Why is this? If you're planning to implement a new system of whatever sort tomorrow, what kind of things should you be watching out for to try and ensure that your system runs as well on day 1000 as it did on day 1? I suspect that several things happen to drag systems down over time:-

1. Increased volume

Everybody who's responsible for introducing a new system tries to figure out what the demands will be on the system not just when it's first released, but also in the future. But this is a notoriously hard problem, and many estimates are really just guesses. If the guess is an under-estimate and the volume grows beyond what the system was designed to handle, then there are going to be problems. If your email system was expected to have to process 100,000 emails a day, but two years later is processing 2 million, or your web server was expected to get 10,000 hits a day but now gets 500,000 then typically two things happen:-

  1. We fail to allocate the human resources needed to handle the increased load. If your content management system needs (say) half a day a week of attention to keep it running smoothly at 10k hits, then it will need more time and attention at 500k. But organisations aren't good at assigning resources to respond to things which happen gradually and incrementally, so it's common to see systems under-managed right up until the point when they explode, followed by the immediate, but too late, deployment of lots of resources.
  2. We try and cope with the unexpected load by doing things which we hadn't originally planned to. Some of those things might be low risk, straightforward strategies; more disk. More memory. Some of them attempt to solve the problem by introducing more complexity into the design. Mirrored servers. Replicas. Now you're not just worrying about how to make one web server work, you're worrying about how to make lots of web servers work together and share the load. The system is more complex and thus more fragile than it was when it started.
2. Increased complexity

As well as increasing complexity in response to volume problems, systems get more complex over time as they attempt to respond to demands for them to do more things. People want calendaring and shared in-boxes in their email. They want workflow in their CMS. Sometimes these additions have been prepared for from day one, sometimes not. Sometimes they are manageable, linear extensions which don't make the system substantially more complex, but sometimes they are pushing the system somewhere it wasn't originally intended or designed to go. Changing a system is risky both in the short-term – the change might break things – and the long-term – the changed system may be more complex and less well understood than its previous incarnation.

3. Knowledge loss

People come and go. New people know stuff that the old people didn't, and vice versa. Even people who are still around don't know as much about how a three year old system works as they do about how a three week old one does. You can try and solve this with documentation and process, to a point, but no matter what you do, understanding decreases over time, and all you're doing is ameliorating that process.

So what distinguishes systems which perform well in the long term versus ones which don't? I'd look for several indicators:-

  • Slow growth in demand, or simple strategies to cope with increased demand which don't increase the complexity of the system.
  • Small, incremental changes in functionality rather than large new feature sets.
  • An agreed lifespan for the system.
  • Systems which aren't critically dependent on the knowledge and experience of a small number of people.
  • At a meta-level, evidence that there has been some degree of thought about the preceding indicators. I suspect you can get away with breaking one or more of the guidelines if you know you're doing it, and why.

- 2 comments by 2 or more people Not publicly viewable

  1. Christopher Hinds

    Good post John, and one that emphaises the small bits that people gloss over in the hope that they can be 'fixed later' when all too often they can't be. I think to some extent you can resolve this more in a clusters high availability environment, where you can join another node to the cluster and have it continue to work as the same original 'public' machine (of course you tested this in the lab before you went to the production servers right?).

    The other thing that is missing I think is that you have to use the right tool for the job, and all too often a cheaper tool wins at the cost of more admin time, more complexity to admin and worse still less reliability. The wrong tool, even at a low price is still the wrong tool.

    31 Jan 2006, 20:32

  2. Chris May

    I think to some extent you can resolve this more in a clustered high availability environment

    Danger! Silver bullet detected!

    * yeah, I know you've caveated it; don't spoil my flow :-)

    There's always someone willing to flog a technology-based solution to scalability. When I started out in IT (adopts '4 yorkshiremen' accent…) it was putting more chips into your mainframe. Then it was moving the processor-hungry apps off the mainframe and onto a midrange box. Then it was putting more processors into your midrange boxes and turning it back into a mainframe (Sun E10Ks anyone? I wonder if the e-university has got rid of theirs yet :-) ) Then it was clustering PCs. Then it was blades. I forget where we're up to now, somewhere between Grids and CMT fault-tolerant CPUs I think. They're all useful solutions to a particular class of problems, but they don't address the fundamental issues.

    It's worth pausing for a moment to observe that google, who have arguably the most scalable technology platform in the world, are currently doubling the size of their workforce (currently around 5000 IIRC) every 12–18 months. Now, some of those people are writing cool new apps. But it doesn't take thousands of people to knock out Yet Another Webmail Client. I don't have any evidence to support this, but I'd hazzard a guess that most of those people are recruited to support scaling the infrastructure, one way or another.

    Whatever architecture you use, if you're doing something unusual you will always hit unexpected bottlenecks. You add another 6 boxes to your web cluster and discover that you've saturated your load-balancer, or you add another 5 storage servers and then discover that you're adding content faster than your backup server can cope with, or whatever. If you're doing something new, this will happen; expect it and be prepared to deal with it. Or you can just never innovate and only do stuff that's been done a gazillion times before (Don't mock; ISP's thrive on this model)

    You're bang on about the cost of the tool. Capital cost of a piece of software (and hardware, for all but a few edge cases) should be the absolute last thing on the list of considerations, because if you expect to have to live with it for more than a couple of years it will be utterly insignificant in the long run.

    And of course, this whole argument applies only to point 1; points 2 and 3 are in no way affected by architectural choices.

    31 Jan 2006, 21:45


Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

Trackbacks

Search this blog

Tags

Blog archive

Loading…

Most recent comments

  • I remember we had a series of books when I was around 10 years old … by Andrew Uttley on this entry
  • The trainers look great. I have been reviewing stuff on my blogs fo… by Dilip Mutum on this entry
  • No movie as far as I know, but The Case of the Silver Egg was made … by John Dale on this entry
  • I checked this book out nearly a dozen times while I was growing up… by Tracy Ramsey on this entry
  • I don’t think it is predictable. Because in this world anythi… by emergency ration packs on this entry
Not signed in
Sign in

Powered by BlogBuilder
© MMXII