Power Email Outage
Writing about web page http://www2.warwick.ac.uk/services/its/
It can’t have escaped many people’s notice that a power cut in the Coventry area last Thursday morning caused (amongst other things) the Uni’s mail servers to fall over. It appears that the UPSs that were meant to protect critical systems in just such an incident failed to do so. According to ITS the three AA batteries that they’d connected maintained power for 45 minutes as planned, but the hamster-wheel failed to kick in (apparently 1 hamster isn’t strong enough to maintain power for Group Shite-wise). This caused catastrophic failure to various Group Shite-wise Postboxes and (as it takes approximately 400 years for Group Shite-wise to back itself up), no back-up appears to be forthcoming/in existence.
I’m not going to lie and say I’m shocked, as the litany of IT (I’m being careful not to restrict this as being exclusively the fault of ITS) failures over the years, with Group Shite-wise being perhaps most memorable. However, I am appalled that in this day and age The Administration see it as in any way less than unacceptable for an institution such as ours to be without the ability to even send or receive email (let alone retrieve archived items). If this were a business, heads would have rolled long ago, and we wouldn’t be in a situation where 5 days on two thirds of staff/research student postboxes are at best described as being in a less than optimal condition.
Moreover I find it incomprehensible that ITS are in liaison with affeced departments to obtain lists of staff and research student user names. YOU WHAT? Surely it is in ITS remit to have at least a list of the accounts that it maintains has f*cked up? I was literally speechless when I heard this.
Finally, (partly for want of not going on and on and on, and partly as I wouldn’t want to post so much that yet more of ITS equipment falls over irretrievably), it would be nice for ITS to keep people informed about the situation. Their vague assertions that staff and research students will be informed about any changes don’t hold much truck when they can’t even keep their website updated with sufficient information (if it is up at all). Quite frankly there are three key questions that people want answering:
- When will I be able to send or receive email?
- When will I be able to access my old inbox?
- What the hell exactly is it you are doing about 1 and 2?
So far ITS have failed to answer any of the above satisfactorally!
49 comments by 6 or more people
[Skip to the latest comment]I have to say, this is the third university I’ve been at in one capacity or another and the IT lags a million miles behind the other two. It would almost be funny if it wasn’t so professionally inconvenient.
For all I know, a postdoc position in Somewhere Nice, California could have been offered and withdrawn due to a lack of reply (since I don’t have internet access at home, the last time I got into my email was last Wednesday lunchtime). Given that I’m unemployed in 6 weeks that’s not good.
How remiss of me to put my work email address on my CV…
13 Nov 2006, 17:04
Mathew Mannion
Do you mean the IT in general, or just the email system (and even then, just GroupWise)? It’s a very sweeping statement to make, particularly given the advanced bespoke software that Warwick has developed in other areas of Information Technology. It’s just a shame that it’s taken a long time to migrate staff users from GroupWise to Exchange, since obviously this is a huge inconvenience.
As a note to the original post, to have a UPS last for 45 minutes is actually extremely rare in my experience, UPSs are generally designed to give enough time for the machine to shut down successfully, but obviously there was added redundancy in these systems that allowed a power cut of up to 45 minutes to be sustained without reverting to the generator. Since this power cut lasted around 90 minutes, the generator should have come online. Unfortunately this did not happen, and the generator failed – since this was at 5am, nobody was around to see the generator warning that it was never going to start and as a result the generator didn’t start.
I work in web development in ITS so obviously I’m not really in the loop with this (I also don’t use GroupWise, I have my email delivered to a UNIX mailbox which I then forward to a Gmail account) but from what I can gather the email team have been struggling with a behemoth GroupWise system for a hell of a long time which does not fit the purposes designed, and as a result has spent the last two years perfecting a solution to this in Exchange. Only now are the email team able to confidently transfer users onto Exchange – it’s unfortunate that it has to happen under these circumstances, but it does in no way mean that the email team is incompetent except in perhaps the lack of communication, although even then there is a notice on the ITS homepage
13 Nov 2006, 18:13
I agree that a UPS hat keeps systems up for 45 mins is rare indeed – we considered getting one after the last power cut (on St Paddy’s day) and one that allows more than about 10 mins on our group server alone to allow graceful shutdown would cost about £20k. I’m even more aware as we then had a period of a fortnight’s uncertainty where the power was being shut down at some point in various places to connect the UPSs (let alone the fortnight’s disruption while they removed the few covered cycle racks near chemistry and built a nuclear bunker in their place outside our office to house it). However, what’s the point of having the system if it doesn’t work? Yes there weren’t any IT staff on hand at 5am to check the generator was up and running and ready to kick in, but Security were – hell, they were the ones’s that called various ITS staff to alert them to the problem when the genny didn’t fire up.
Why is it always the critical systems that get buggered and machines in Uni House that my colleagues are running calculations on that stay up?
Yes there is a notice on the ITS homepage. I’ve been checking the bastard thing every couple of hours in the vain hope that it might garnish some more concrete answers than:
It is exactly because of this kind of problem that various departments around the University, either with the in-house expertise (DCS), or the money to buy it in (WBS) have devolved themselves from the University’s central email system. Yes the guys in ITS have been working hard to deliver a quality replacement email service, but we’ve now been waiting nearly two years and we are yet to see the fruit of their labour.
Finally, I’d love to know whether the VC’s email is up and running – bet you a pint it is…
13 Nov 2006, 18:58
Matthew Jones
Well, seeing as only a third of (GW) staff e-mail is affected by this, it’s twice as likely it is fine than it isn’t. Of course that implies that he uses GW, which he might not.
And as a former CS student, I used to forward all my CS mail to GW without problems.
14 Nov 2006, 10:11
Stuart Coles
I have no problems with my email. Score 1 for being an Engineer.
14 Nov 2006, 10:51
True. I suspect that had his email been affected like this, I wouldn’t have needed to post….
That still leaves you on -9 overall…
Oh, and having followed all of the ITS instructions (3 times) I still can’t get into either Outlook Web Access or my mail via POP or IMAP.
14 Nov 2006, 11:25
John Dale
Not so, in fact; the VC and the Registrar and their respective departments were all on Post Office 2, which means that like other affected users they are now using Outlook and Exchange.
14 Nov 2006, 12:24
That shows you what I know (but on the upside someone owes me a pint). However, there is an assertion there that affected users are able to actually use Outlook and Exchange, which isn’t universally true…
If anybody has any useful suggestions, I’m all ears.
14 Nov 2006, 13:25
Given the total number of work hours the University must have lost as a result of the (repeated) disruption that occurs to IT systems as a result of “out of hours” faults, one would have thought it would be an overall saving to employ someone to watch what was going on all the time. Maybe that would require too much coordination (probably from a higher level than ITS).
14 Nov 2006, 13:31
John Dale
I’d suggest calling the helpdesk on 73737. They are aware of the situation with Post Office 2 users who are trying to move to Exchange, and they will pass your problem straight on to the Exchange team. Once the problem has been identified, our experience so far is that it’s normally pretty quick to resolve it. I’d be interested to hear how you get on.
14 Nov 2006, 13:32
Thanks – I guess emailing them this morning was a little naive…
14 Nov 2006, 13:33
John Dale
It sounds tempting, doesn’t it? But I guess just having them watch wouldn’t be much good; if you see that your UPS has kicked in but your generator has failed to start, what you really want is not just an observer, but someone who knows how to start a generator. Except that what if it wasn’t the generator? What if it was one of the core switches in the machine room? Hmm. So we’ll need a network guy too. Except, what if it’s one of the Post Offices running out of disk space? Drat. Now we need a sysadmin and possibly an email guy.
And so on. To deliver effective out-of-hours cover, the slightly unpalatable truth is that you need not just an observer, but also large parts of your Estates and your IT Services staff to be available on call-out 24/7. The university isn’t oblivious to the benefits that this could bring, but it’s a more difficult and costly exercise than it might seem at first sight, and so far, the university has elected not to implement it.
14 Nov 2006, 13:53
Apparently various users accounts weren’t created automagically in the transfer process – mine was one of them, as were those of the 3 other people in my office that were on Groupwise. Not that much of a wonder when ITS have to contact our departmental computer advisor to get a list of user codes.
It seems from one of the earlier notices is that all it would have taken is somebody checking that the generator started up as the 45 mins of UPS power neared it’s end. Plus I’d imagine there’s a tangible cost associated with replacing various pieces of hardware that failed, and bringing in the recovery experts. Not to mention the damage to the University’s reputation with employers and potential employees.
I must have a particularly high expectation of IT services (not, in this case, ITS) – The Guardian gives out awards to websites that haven’t been updated for 8 months.
14 Nov 2006, 14:10
Well obviously, and I’m aware that the University has recognized that major IT failure is a serious potential risk to the organisation (or at least it did two years ago when I had some technical responsibility for the organisation). However, the level of action to address that risk never appears to be proportionate to the level of action taken to address other, often lesser, risks. There often seems to be an assumption that major out-of-hours failures don’t happen that often, but they do. Of course, that isn’t an IT issue, it’s a general management issue.
14 Nov 2006, 15:23
Max Hammond
or someone who knows how to use a telephone, to rapidly wake up those who know the answers. Or adequate support and systems that your suppliers can fix what they can remotely.
Network components failing is a nuisance, UPS failing wrecks things. Especially since Warwick’s UPS is known to be dodgy one might consider prioritizing it.
14 Nov 2006, 15:27
Max Hammond
Oh, and Warwick’s outage is a serious and embarrassing enough problem to make it onto The Register :-)
14 Nov 2006, 15:31
Am I right in thinking that the telephone system relies on central campus networking? Power goes off, phones go off. But it’s OK as we have a UPS, and failing that a generator. :-S
Anybody know what this was doing at the time?
14 Nov 2006, 15:52
Now with external publicity http://www.theregister.co.uk/2006/11/14/warwick_email/ (came via my technology feed at work!)
14 Nov 2006, 15:56
Mathew Mannion
Security phoned up the director of ITS almost immediately after the failure occurred, from what I’ve heard. Would hate to have that job.
14 Nov 2006, 16:53
John Dale
Absolutely. But we’re saying the same thing, aren’t we? – that to do out-of-hours support properly, you need technically capable people on call, and Warwick as a matter of policy currently doesn’t do that.
14 Nov 2006, 17:15
Unfortunately, with an organisation like the University, it’s just not possible to quantify the ‘cost’ of downtime across the various user groups versus the cost of maintaining 24h support but I suspect that like most organisation the balance would be toward the former. It’s a bitter pill to swallow though! In the commercial world, one of the major drivers of the outsourcing phenomenom was the attractiveness of offsetting the cost of such ‘redundancies’ (i.e. 4 staff for the 363 nights per year that they have nothing to do but play Quake) by hiring in specialists whose core business is having techies work all hours.
I don’t think that would work particularly well in a university context as centralised control of IT functions is percieved as pretty critical and the user needss are just too diverse (although some functions are already outsourced to the likes of JANET).
14 Nov 2006, 17:37
Matthew
Of course it’s possible to quantify the cost of the current downtime. The recent change to Full Economic Costing insisted upon by the Research Councils has ensured that it would be pretty easy to work it out.
I hope someone does work it out and then bills IT Services.
14 Nov 2006, 18:25
Max Hammond
Really? The cost of IT time in fixing it is easy, but how much staff time has been cost? How do you work out how much longer it has taken the university to conduct it’s business while STAFF2 has been down? How many lost opportunities have there been? How many students won’t come here because of the cost to Warwick’s reputation caused by this kind of acutely embarrassing failure?
Kind of. I’m saying that some systems may need 24/7/365 support and may not. Since the power systems seem to be somewhat shaky, and are clearly an SPF in the Warwick infrastructure, they might be prioritised, in case of just this kind of incident.
14 Nov 2006, 18:47
>>Now with external publicity http://www.theregister.co.uk/2006/11/14/warwick_email/ (came via my technology feed at work!)
Well, getting noticed on the “Register” is pretty bad news. I would say it wipes away most positive public relations for the past few years. I have found that the “Register” is a prime point of departure for when researching background information on individuals and institutions, and essential if it involves IT. The “Register” piece will end up as a pretty high ranking referral from any search-engine search on “Warwick University”. Not quite the international reputation the University wants.
14 Nov 2006, 18:57
Matthew
My (possibly incorrect) understanding of the fEC system, based on various interactions with Research Support Services, is that each university has had to calculate the per hour indirect cost associated with each member of staff. This indirect cost comprises of, amongst other things (heating, lighting, electricity, clerical support etc), the cost associated with providing information services. This information has been required on all grant applications for well over a year now.
Consequently, working out the full economic cost associated with the current outage is simple: take the component of the university’s per hour fEC indirect cost associated with email use, multiply by the number of staff and by the number of hours this has been going on for, and there you have it.
But, as you say, this could well be an underestimate given the effects on the university’s reputation. I, for example, had to ring up a journal editor this afternoon to apologise for not getting the corrected proofs of an article back to him within the deadline. I’ve given up trying to protect Warwick’s reputation, and now just say “our IT team are incompetent”.
14 Nov 2006, 19:02
Max Hammond
That approach would by definition result in a measure of the cost of operating the email service, not of the excess cost in not operating it. The benefits of operating email should massively outweigh the costs of running it, so what you’re actually after is the costs incurred through the disbenefit caused by the service not being there, and that is a hard thing to calculate.
I don’t believe that’s true. I don’t know enough about the management of ITS to have any detailed view about where the blame for these repeated incidents should lie, but it’s very unlikely to be with the people who operate or even designed the systems.
As I said in a comment on my own blog earlier, every incident is preventable, it just comes down to how much an organisation is willing to spend to mitigate any given threat. Evidently, Warwick’s power supply and UPS are not really up to task; the failure to provide these systems appropriately is one of management, not technology.
ITS management apparently finds it easy to invest in big, showy projects (blogs, forums, SSO), but not in fundamental infrastructure services, be they hardware (UPS/generators) or systems (replacing Groupwise in a timely manner). I don’t know whether this is a failing of ITS or rather of the university management from whence all funding eventually comes, but it is certainly a failing.
14 Nov 2006, 19:58
The company I now work for has a groupwise e-mail and calender system installed. When I first joined, I was very apprehensive about how well it would work based on my experiences at Warwick, but having been there for two months it has yet to go wrong once. Our company is hardly small (although doesn’t have anywhere near as many accounts as Warwick does to manage), but there is nothing wrong with the actual Groupwise program that we have at all. This to my mind re-enforces my view that ITS are, for whatever reason, not up to scratch in this department, be it understaffing, poor training, bad management, bad implementation, whatever. But there is nothing actually wrong with Groupwise.
On an unrelated note, does anyone remember something like this happening last year? I believe on that occasion the circuit boards for the UPS blew up. Certainly, two major UPS failures within a year brings into question the quality/state of repair/implementation of Warwick’s hardware.
14 Nov 2006, 20:32
John Dale
At the risk of seeming unduly pedantic, the current incident wasn’t caused by a UPS failure. The UPS systems in the University House machine room and the ITS one both did exactly what they were supposed to; started up and provided about 40 minutes of power to the two rooms. In University House, the backup generator then took over, but in the ITS machine room, the generator failed to start, so when the UPS batteries ran out, there was an uncontrolled shutdown. No doubt there is lots of blame to go around, but on this occasion, the UPS’s are innocent!
14 Nov 2006, 21:50
Mathew Mannion
And that’s where the issue is. Novell will not support a database larger than 1TB, and Warwick’s was much, much bigger than that. As a result, none of Novell’s provided tools to fix the inevitable GroupWise problems can be easily run.
15 Nov 2006, 03:15
Max Hammond
Which begs the question as to why Warwick is running a core service in an unsupported configuration?
15 Nov 2006, 08:33
Mathew Mannion
Because it took x years for Exchange to be a viable alternative, I’d imagine. I’m not an expert on the email system and I don’t claim to be.
15 Nov 2006, 09:43
Max Hammond
It’s not a problem with the email system, Mat, it’s a problem with the management system. Letting a mission-critical service get to a point where it is so comprehensively broken is simply poor management.
I have no doubt that there are many competing factors for time and other resource, but email is something that should be at the top of the list. Saying “Groupwise is rubbish, so we’ll let it fester for two years until we sort out a replacement” is a very poor strategic choice, as is being so clearly demonstrated now.
15 Nov 2006, 14:17
It’s a good starting point, however, there is then the on-cost associated with commercial departments (Warwick Conferences etc.) plus other intangibles such as RSS/WV commercial activities etc that a plc would look at.
15 Nov 2006, 14:46
Allan
..but atleast blogs come back OK or you people would have one less way of complaining about Warwick University and IT Services!
15 Nov 2006, 16:09
True, but in the absence of a working email system and with 3000 Groupwise users all waiting in a queue to get hold of the ITS Helpdesk….
15 Nov 2006, 19:09
For what it’s worth, my email is back up and running as of yesterday PM.
Plus it seems somebody has finally realised one of the problems and ITS are seeking ‘professional advice’
16 Nov 2006, 12:15
Edward Ryan
Has the head of IT resigned yet?
If not why not?
If he has not he should be sacked immediately.
Power cuts happen regularly and IT services should have had an effective contingency in place. If the contingency was not in place or did not work the responsibility ultimately must be the head of IT.
After such a fiasco the only honourable thing is for a resignation.
17 Nov 2006, 10:18
John Dale
I would have thought that somebody with such strong views on the subject would at least have taken the time to find out that (a) the head of IT is a woman, not a man, and (b) she is retiring at Christmas, a fact which was announced some time ago.
17 Nov 2006, 10:29
Expert in
GroupwiseMS Exchange? Looking for an ‘exciting opportunity’ to join the email Replacement Project ‘team at a critical point?’ Look no further than a job with ITSApparently, ‘the potential of Internet technologies to transform and improve core services is widely recognised.’
17 Nov 2006, 14:24
Edward Ryan
Correcting my previous comment
SHE should resign
20 Nov 2006, 09:38
mick
see http://blogs.warwick.ac.uk/nwake
20 Nov 2006, 09:55
Edward Ryan: perhaps getting your facts right would be a really good idea.
The failure of the email systems was caused by the generator problems. The generators are managed by Estates. Why not ask for the Head of Estates to resign?
I suppose at the end of the day it’s easier to write outraged rants and blame the obvious targets than do a little research.
20 Nov 2006, 16:38
Pete
Maybe the whole place should have stuck with unix mail rather than switching to groupwise a few years ago.
20 Nov 2006, 20:33
mick
Graham
You seem to be missing a fundamental point. There are fundamental weaknesses in the University’s disaster recovery plan which mean that the power problem caused massive user disruption. I don’t think that the head of IT should resign, they should be sacked.
21 Nov 2006, 07:35
Max Hammond
It is the responsibility of the consumer to make sure that the products which they procure are fit for purpose. If estates can’t manage a generator, ITS should be buying what they need. At the very least, they should have a management handle on how mission-critical services are provided.
As I’ve said repeatedly, it wasn’t a problem with the generators, it was a problem with the management.
21 Nov 2006, 09:19
E.ryan
Hear ! Hear!
21 Nov 2006, 09:33
Bugger, we should all resign as we are ITS’ customers and we have failed to make sure our products are fit for purpose.
Joking aside, the issue of responsibility is a difficult one. Graham, I can understand you being defensive about your department’s role in the generator failure, but unless you asked Estates for a particular (spec’ed) bit of kit and they provided something sub-standard or substantially different then it comes back to you.
Now, I completely understand John’s point way back about requiring ‘large parts of your Estates and your IT Services staff to be available on call-out 24/7,’ but (forgive me if I’m mistaken) don’t we already have at least half of that situation, with Estates’ maintenance team being on call for assorted emergencies? If a radiator/sink/pipe leaks overnight (and it’s reported to Security) then is it left to run? No, Estates are called and they come and fix it. If somebody breaks a key in their lock in Halls are they left out in the cold? No, a locksmith is called in to replace the lock. The University already provides out-of-hours service through Staff and Contractors, but it seems somebody doesn’t consider IT systems to be important enough to warrant bothering about for the 120-odd hours a week that the Helpdesk isn’t open.
If you have a mission-critical piece of kit, worth tens of thousands of pounds in direct replacement cost alone, and the power to it is compromised do you leave it to chance that the generator will start up fine and dandy having not tested it on full load? If you have a mission critical piece of kit, do you let it get to the state where a back-up takes 2 full weeks to complete and is useless if it hasn’t completed successfully? If you have a mission critical piece of kit, do you take back-ups with no idea how to use them?
It’s all well and good asking people to do a bit of research, but unfortunately there isn’t a helpful website with:
Joe Bloggs – Responsible for
GroupShite-wise.Jon Bloggs (Estates) – Responsible for feeding the hamster & replacing the AA battery every 3 months.
Jan Bloggs – Responsible for the whole shebang and retiring at Christmas.
Jeff Bloggs – Responsible for finding some poor schmuck dumb enough to take over from Jan Bloggs running a department whose reputation is in tatters.
TBA – Responsible for Exchange. To apply for this post contact Personnel.
Jim Bloggs – Responsible for the migration to Exchange taking 12 months (and counting) longer than originally stated.
Jo Bloggs – Responsible for information about the who, what, when, where and why of systems coming back up not being available.
Yes it’s easy to write outraged rants. But that’s what we, the users, are – outraged. Outraged at one more instance where we have been let down by the technology that is supposed to make things easier and quicker for us. Outraged at one more instance where we have been let down by the people that are employed to make sure that that technology is working and fit for purpose. Outraged that our complaints about Groupwise over the last 2 and a half years have falen on deaf ears and the Exchange Project hasn’t been fully implemented. Outraged that we have to go, cap in hand, to everybody that might have sent us an important email in the past 5 weeks to ask them to resend them. Outraged that it took a week to get a working email account. Outraged that there was insufficient up to date information about the ongoing problems. Outraged that the people responsible keep passing the buck and outraged that nobody in the Administration seems to give a damn enough to do anything about it save create yet another Committee that meets once a term.
21 Nov 2006, 11:56
To be honest, I don’t know why everyone is blaming ITS for a power cut beyond their control.
Shouldn’t we be wondering why we still have power cuts in 2006 at the UK’s 5th ‘best’ university?
I guess the UPS is kinda ITS’ problem, but still…
27 Nov 2006, 18:07
Sure, you can’t blame ITS for the power cut, but you can’t blame the University either. Heck, you can’t even blame the power company. Power cuts happen – equipment breaks, accidents happen, vandals break in, thefts occur. These are all inescapable and have nothing to with the day and age that we live in, or our geographical location. The blame is being laid at the feet of ITS because of their failure to adequately manage the situation, and because of the failure of the systems that they put in put in place.
28 Nov 2006, 11:39
Add a comment
You are not allowed to comment on this entry as it has restricted commenting permissions.