All 35 entries tagged Tech

View all 318 entries tagged Tech on Warwick Blogs | View entries tagged Tech at Technorati | There are no images tagged Tech on this blog

October 07, 2006

Edgy trivia

  • If you want a kernel post 2.6.17.7 to boot into X on a Toshiba laptop, modify /etc/modprobe.d/toshiba_acpi.modprobe so it says
    options toshiba_acpi hotkeys_over_acpi=0

(toshiba hotkey support is broken in recent kernels)
update fixed in 2.6.17.10.29 :-)

  • Flashplayer-nonfree works on XGL again! Just make sure you’ve got 24-bit colour and all the latest xserver-xorg packages, and it’s all good. Ah, youtube, I’ve missed you…

September 19, 2006

Unicode, UCS, and encodings

About once a year or so I come up against some multi-byte text issue that requires me to learn a bit more about how unicode works. Each time, I come away from it thinking ‘yeah; I understand this now’. So far this has always turned out to be a bit wrong. However, I’ve moved forwards a little bit today, by coming to grips with what the UCS is.

This episode was sparked by a question from Hongfeng, who asked ‘how come our web pages are served as iso-8859-1, but they can still display Chinese characters?’

It’s a good question; iso-8859-1 is a very small and parochial encoding, designed specifically for efficient coding of western european text. It specifies encodings for the letters in the roman alphabet, plus their accented versions, and common symbols. But, unlike a multibyte encoding like UTF-16, it doesn’t know anything about non-european alphabets. So when the browser sees a sequence like ‘& # 1488’ (minus spaces ) how does it know that it should convert that into an ‘aleph’ (א) glyph?

The answer is the UCS . The UCS is the uber-charset; a set of about a hundred thousand characters in every alpabet from Klingon to Olde English. Each character in each alphabet is given a unique number in the UCS, so there’s never any question of people disagreeing on whether 1488=’B’ or 1488=’aleph’. (though it’s quite possible that two code points might refer to two indistinguishable glyphs)

iso-8859-1, utf-8, utf-16, and the other encodings, are just subsets of the UCS which are optimised for a particular kind of text. So if you want to send a page of french text, then you can do it in a lot less bytes if you encode into iso-8859-1 than if you use UTF-*. But HTML gives you the option of specifying a raw UCS code point, via the &#{number}; notation. So as soon as you come up against a character that’s not in your target charset, just look up it’s UCS codepoint and encode away.

UCS is therefore somewhat more reliable than using high-byte characters. For instance, if you want to use ‘smart quotes’ in XHTML you can either use the UCS code points 8220 and 8221, or the UTF-8 encodings c293 and c294, or the HTML entities &ldquo and &rdquo. But if you use the HTML entities you won’t be able to parse your content as XML unless you predefine the entities wherever your markup is to be used; if you use the utf version you won’t be able to use your markup on a page which isn’t utf-8 (which is a pain when you’re syndicating other people’s data). If you use the UCS version, it might cost you an extra byte per character, but it’s universally re-useable. Which is good.

A couple of useful references for more info:

September 12, 2006

Wierd Apache / mod_jk / Jboss keepalive bug

A curious bug manifested itself on our production web server yesterday. The server runs apache, with MaxClients=500, and handles about 3 million hits per day. About 80% of requests are for static files, and the remainder are dynamic, handled by mod_jk delegating to 2 jboss instances (round-robin load-balancing), each with 250 tomcat worker threads.

Now, round about mid-morning, we got a fairly big spike in the number of requests per second coming through, and very quickly the site became unresponsive. Apache reported about half of it’s workers being used for keepalives, and the remainder waiting on a response from jboss.

Jboss reported all of it’s mod_jk worker processes were busy, but only a very small number were actually involved in servicing requests, most were in a ‘K’ state (keepalive). The load on the box was very light (only a few % CPU) and testing via the jboss HTTP interface (which was unaffected by mod_jk) suggested that there were no problems actually handling requests.

Now, a mod_jk/AJP keep-alive, as I understand it, isn’t supposed to work like an HTTP keepalive. An apache worker that’s in a keepalive state is reserved for the exclusive use of the client (browser) that’s connecting to it. If you have too many distinct clients connecting and holding keepalives, you’ll run out of httpds. But, an AJP keepalive is a keepalive between the apache server and the jboss server; although it knows the client IP address, it’s not limited to only serving requests from that client. Even if it is, it should recognize when apache terminates the keepalive connection with the client (after 15 seconds or 100 requests in our case) and make itself available for other client connections.

However, in our setup that didn’t seem to be the case. It seems as if, once an AJP connection is marked as holding a keepalive to a client IP address, it will not service requests from any other IP address for some relatively long period of time (Not infinite, but much longer than apache, anyway). The result of this is that it doesn’t take long under load for all of your worker threads to be tied to particular clients, waiting for the next request.

We solved the problem, rather cludgily, by simply disabling client keepalives in the web server. This makes the process of rendering pages slightly slower (since each request must set up and tear down a TCP connection for every image, stylesheet, and other static resource the page references) but it’s not really noticable. It’s had the additional benefit that our apache server has gone from having about 250 active httpds on average to about 50.

A much better solution (apart from fixing mod_jk), IMO, would be to ditch mod_jk in favour of a connectionless HTTP load balancer like haproxy. Then we could re-enable keepalives from the web server to the client, enabling fast loads for the 80% of static content, but disable them from the web server to the JBoss server, thus preventing Jboss from spawning too many threads which are just sitting idle waiting on a client keep-alive. Plus, switching to HTTP would allow us to do funky pipeline things like sticking a Squid cache between web server and JBoss, to further speed things up. Additionally, the overhead of TCP connection establishment, whilst it might be significant on a static request for a 2K stylesheet, simply doesn’t figure in the time to render a page with 30-odd database queries and a stack of java code behind it.


August 01, 2006

Creeping Statefullness

A while ago, when we were designing Sitebuilder 2, one of our design goals was that the app server should be as stateless as it possibly could be. We wanted to end up in a situation where we could scale the app simply by bringing new servers online, with no need to replicate between them. We also wanted to be able to bring individual servers down for maintenance without users noticing, simply redirecting requests onto the remaining servers.

As far as the viewing of pages is concerned, this has worked out pretty well. There are occasional blips when our version of mod_jk fails to realise that a server has dropped out of the cluster, but they're rare and we could work around them quite easily if need be.

But for editing we haven't quite realised our goals. It started out very well, but we were seduced by a bit of technology that nearly did what we needed, but not quite. That technology is based on Spring Web Flow .
What SWF does (amongst other things) is to let you associate an arbitrary bunch of objects with a business process. So if you were, say, uploading and unpacking a zip file onto the server, then the process might include 3 steps – you upload the file, then you choose which files go where, then you're told which ones were successfully uploaded and which weren't.

There's quite a bit of state associated with that process, and SWF does kind of solve the problem of how you could do step 1 on server 1, step 2 on server 2, and step 3 on server 3. It does this by using 'client continuations' – basically, all the server–side objects needed for the process are serialized and the resulting ObjectOutputStream is written into a field on the form, and hence re–submitted when the user processes the next step.

So far so good. But the first hurdle comes when you've got a lot of server-side state - like, say, a 200MB zip file full of MP3s. If you try serializing that back to the client, you'll have a lot of network IO, plus you'll have to post all of your forms as multiparts, which is a bit bogus.
So, when we serialized our objects, we wrote all the files out to some shared file storage, so that any node in the cluster could pick them up (for purposes of disaster planning, our shared storage never fails ;–) )

So far so good; now we have clients with almost all of the server–side state they need to continue the process, and the rest of the state is shared amongst all the nodes.

But spotting the files to store on the server is kind of tricky; sometimes they're buried at the bottom of an object graph in hard–to–find places. So some clever chap hit upon the realisation that, if we're relying on a shared file system for some of our state, we might as well rely on it for all of the state. So instead of serializing the objects and sending them back to the client, we serialize them all to disk, and just send the client a pointer to the file on disk. All good ?

Well, then along comes the next problem. Someone adds a non–serializable attribute to one of these objects. Of course, it's buried at the bottom of a huge graph of objects whose main job is something compeltely different to holding conversational state, so no–one spots until it gets deployed live, and suddenly all kinds of edit operations are throwing NotSerializableExceptions. Great. Sorry, everyone.

So, we write some mildly heroic custom infrastructure that looks through the objects it's about to serialize, spots any non–serializable classes, and calls a special beforeSerialize / afterDeserialize hook to allow the object to convert itself properly, passing it whatever service objects it might need.

Now is it all fixed? Well, no, actually it's not. Because if we release a new version of the code, and the new version changes one of the serialized classes, then anyone who's in the middle of an editing process when we release the code, is going to find that their nicely serialized state is no longer compatible, and is going to staring at something like local class incompatible: stream classdesc serialVersionUID = -6586187098630577013, local class serialVersionUID = -8144714009347234947 . Great.

There has got to be a better solution to this; and I can't help thinking that it probably just involves a form with a big bunch'o hidden fields with all the previously–submitted data in. Oh well, back to the old skool we go…


June 29, 2006

How to look an idiot in 1 easy step

(I debated whether this was one of those things better kept a secret, but in the end I decided it was more fun to share…)

So, we have a little system (based on Nagios ) that we use for monitoring all the various servers and applications that we run in webdev. It tracks the status of about 60 assorted web services of one sort and another and, amongst other things, keeps logs of performance. Like this.

Now, a couple of weeks ago I noticed that some of our services seemed to have got a bit more 'spiky'. Not by much – about 20 milliseconds or so more latency, and not terribly consistent.

I didn't pay it too much attention – it wasn't enough to be visible to end users, and we often get short periods of lag like this when then network's playing up.

Then today, it suddenly stopped, and the graph went flat again. Curious, I thought. So I rang up a colleague in the network team, and asked if they'd changed anything that might have improved performance for us. They hadn't, but requested a bunch of extra information so they could do some more diagnostics.

I started gathering some more data, and only then noticed that all of our services seemed to have got 20ms faster at almost exactly the same time, 10am. Surely they must have done something to the network at westwood (where the monitor is) to have affected so many boxes at once?

Then suddenly I had one of those 'oh bugger' moments, as I recalled going into our machine room at about 9:45 that morning, and noticing that someone (probably me) had left a console logged in to the monitoring server, which was now happily spinning away a screensaver. I logged it out at, yes, exactly 10 am.

So all the lag was in fact nothing whatsover to do with the network, it was in fact the monitoring server trying to task–switch between an openGL screensaver and a timed TCP connection. Ooooopie. Time for a hasty message to the network team to ask them not to look at it any further, and to apologise for wasting their time.

And the moral of this story? There isn't one really. 'Don't run screensavers on your production boxes' is a bit too obvious, isn't it? In my defence, I'd point out that the CPU load that the screensaver imposes is tiny – just a couple of percent – but it's a single–CPU box, and the cost of context–switching back and forth between X and the monitoring process is, oh, I don't know, about 20 milliseconds?

sigh…


June 07, 2006

(nearly) All binary searches are broken

Writing about web page http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

I love this. The standard algorithm for binary searches (and merge sorts, and other divide–and–conquer algorithms) is broken, and has been for approximately the last 60 years, and only in the last couple of years has anyone actually noticed.

Granted it's only broken for humungous arrays, but still, you'd think someone would have spotted it, given that it's taught on day one of pretty much every comp sci degree in the universe. Tells you something about the value of formal proofs I guess :–)


May 04, 2006

csv–to–ical in ruby

I've inheritted a list of SSL certificates that we maintain for various (about 30 odd) different servers. What's irksome about it is that I need to remember in advance to renew the certificates. The excel spreadsheet I've got isn't much good for that. Clearly, something that's calendar–based would be simpler. So, excel would give me a CSV easily enough, but how to get that into iCal / Evolution / google calendar / whatever ?

Simple. Watch:

require 'CSV'
require 'rubygems'
require_gem 'icalendar'

cal = Icalendar::Calendar.new

CSV::Reader.parse(File.open("65279certs.csv")) do |field|

cname = field[0]
serverType = field[2]
owner = field[1]
expiry = Date.strptime(field[10],"%d/%m/%Y")
certID = field[8]

reminder = Icalendar::Event.new
reminder.summary = cname +" Certificate expiry"
reminder.uid = certID
reminder.dtstart = expiry -14
reminder.dtend = expiry -14
reminder.description = "#{cname} (#{serverType}) expires #{expiry.to_s}\n#{owner}"
cal.add reminder
end

puts cal.to_ical

(please excuse the wonky indentation )

Then all I need to do is chuck the results on the web somewhere and not only can I subscribe to it, but so can anyone else who wants to know when their certificates expire.

Seems to me that that strikes the balance between a write–only perl one–liner, and a bazillion lines of java. Yay for ruby.


April 27, 2006

SOA integration with Flickr and del.icio.us

Writing about web page http://blog.labnotes.org/2006/04/26/soa-integration-with-flickr-and-delicious/

link

Very Well Made. :–)


March 20, 2006

T2000: Disappointing Sun box

update added sparc 3 numbers for comparison

I've been doing some benchmarking on a super-spanky new Sun T2000 server, to see whether or not it might make a good replacement for some of our other kit. Alas, the results are not what I wanted to see…

Using ApacheBench with a single thread, to test a Sitebuilder 2 page request, I get the following:
on the production server (4 * opteron dual-core )

Percentage of the requests served within a certain time (ms)
50% 194
66% 197
75% 200
80% 203
90% 209
95% 221
98% 239
99% 249
100% 249 (longest request)
on the pre-production (2* opteron dual-core)
Percentage of the requests served within a certain time (ms)
50% 191
66% 197
75% 197
80% 202
90% 209
95% 227
98% 233
99% 238
100% 238 (longest request)
on a dual ultraSparc iii box ( 1.2 Ghz)
 Percentage of the requests served within a certain time (ms)
50% 435
66% 441
75% 447
80% 450
90% 458
95% 466
98% 616
99% 676
100% 676 (longest request)
on the test box (1 x T1 8-core 4-way CMT)
Percentage of the requests served within a certain time (ms)
50% 471
66% 471
75% 472
80% 477
90% 489
95% 501
98% 531
99% 531
100% 531 (longest request)

I guess this shouldn't really be surprising. The opterons are optimised for all-out single threaded speed, and this test is doing pretty much exactly that (worth noting in passing that I actually used the oracle DB from pre-production on the test box, so the test is slightly unfair on the preprod box). The T1 seems to be about 2.5 times slower than the opteron. It's just a shame that the times are just a bit too slow for my app. It's slightly surprising too that the Sparc 3 box was virtually as quick, despite being 4 years old.

Multithreading,the comparison against the 2-way box becomes a bit more even. 15 concurrent requests is a good average workload for us in production.

Quad opteron – 15 threads

Percentage of the requests served within a certain time (ms)
50% 257
66% 269
75% 278
80% 286
90% 307
95% 323
98% 362
99% 524
100% 733 (longest request)
dual opteron – 15 threads
Percentage of the requests served within a certain time (ms)
50% 668
66% 808
75% 872
80% 927
90% 1052
95% 1161
98% 1293
99% 1400
100% 1697 (longest request)
Sparc 3 – 15 threads
Percentage of the requests served within a certain time (ms)
50% 4762
66% 5747
75% 6157
80% 6443
90% 7169
95% 7727
98% 8239
99% 8617
100% 9556 (longest request)
T1 – 15 threads
 50%    642
66% 675
75% 689
80% 697
90% 718
95% 738
98% 763
99% 812
100% 852 (longest request)

– but alas, 600 ms is way above our acceptable threshold for render times for this app. So it seems that this is going to be a box that's good for apps with a high throughput, but a relatively low amount of work-per-request (or a relatively long response-time-per-request requirement ). The question I now need to ask is; have I got any of those ?

Just to round out the picture further, at 30 threads and above the T2 starts to really overtake the dual opteron. (I didn't do a 30-thread comparison against the sparc 3 box because it would clearly be out of it's depth, nor against the production box for fear of breaking it for real users!).

dual opteron – 30 threads:

Percentage of the requests served within a certain time (ms)
50% 1362
66% 1619
75% 1787
80% 1889
90% 2111
95% 2437
98% 2933
99% 5813
100% 11594 (longest request)
T1 - 30 threads
Percentage of the requests served within a certain time (ms)
50% 1157
66% 1183
75% 1200
80% 1209
90% 1293
95% 1424
98% 1779
99% 1965
100% 2760 (longest request)

February 27, 2006

CPAN envy

Writing about web page http://www.cpan.org

Make no mistake, perl is a hideous swamp-creature of a language. But CPAN is just the absolute shiznit when it comes to package download and installation. Forget all this JAR file dependency nonsense, just pick any arbitrary programming task, dig about a bit on search.cpan.org to find the appropriate package (someone, somewhere, will have already written it) and do 'install Package::WhatIWant' .

With the possible exception of Crypt::* on the mac (it doesn't seem to be able to compile stuff for darwin), it just works every time. I wish java had something 1/20th as easy and reliable as this.

Sigh…


Most recent entries

Loading…

Search this blog

on twitter...


    Tags

    RSS2.0 Atom
    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXXI