Character encoding, Unicode and UTF–8
Writing about web page http://www.joelonsoftware.com/articles/Unicode.html
When you're dealing with reading data from various sources and then end up doing some processing on it and display it on the web, most of the time you don't worry about character encoding. However, occasionally it comes along and bites you.
I always used to know that there were different character encodings and you could end up not displaying international characters properly if you used the wrong type and so on, but I didn't really know about it in depth. This is where good old Joel comes in. He wrote an article a while back entitled:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets . He does a pretty good job of explaining things.
My specific problem was an international students name was coming out of our directory (NDS) like this H??hner. It turns out they are actually called Hühner. So that one character was being turned into ??. No good. Usually I just say "oh, some character encoding problem" and give up. But sadly I was determined to get to the bottom it. Upon closer inspection, the ?? were an artifact of appear on the web (different encoding again), but in my java code, their name was: H├╝hner. Nice.
Doing an ethereal trace on the traffic to my machine when I queried NDS for this person, I saw that:
48 e2 94 9c e2 95 9d 68 6e 65 72
seemed to represent our users name. This is hex and having a look at some character encoding charts, it turns out that this is UTF-8. Is there an easy way of fiddling about with different encoding in java…not that I can find. So, following the instructions on UTF-8 encodings from here I worked out that in Unicode that UTF-8 sequence is:
0x48, 0x251C, 0x255D, 0x68, 0x6e, 0x65, 0x72
Which does indeed turn into H├╝hner. So, nothing was wrong in my code and it proved that NDS was storing something obsure. Pleasingly, a quick email to our friendly systems team with this evidence and they got it fixed and are now going through the directory trying to fix bad entries and work out where this strange encoding is coming from. Hopefully our international students will soon no longer be seeing their names scrambled :)
Geek talk over.