So Jon Udell, in the midst of demonstrating the increasing maturity of query languages (XPath and XQuery) on xml databases, generates statistics of bloggers he reads who most frequently cite books on Amazon.com.
In analyzing the data, it is clear that I'm one of those who mines the zeitgeist, peppering my posts with literary and musical references. That's not news to anyone who reads me and indeed its for that reason that I moonlight at that sinister cabal called Blogcritics.
What drew my interest however was a little technical quirk. If you look at the results for this blog you'll notice a link formatted as follows.
Cooking with Rokia TraorÃ©
Thus Traoré begat
TraorÃ©I wondered: what happened here?
Now I created my post in a text editor, Notetab, on Windows. I have muscle memory so that typing Alt + 0233 for the e-acute symbol (é) is no problem. I find it faster to typing that than recalling the alternative html entity for that symbol
Blogger generates an Atom 0.3 xml feed encoded in UTF-8 and if I inspect the feed, I see the following:
type="text/html">Cooking with Rokia Traoré</title>
In other words, the Alt + 0233 character is preserved in the feed. Good. At least Blogger is out of the loop.
The question then is how did an escaped Alt + 0233 become
Ã©once it reached into Jon's xml datastore and was extracted by an xquery?
Ever since I implemented the html export features in Freelance Graphics back in 1997, the bane of my existence in every project has been "special characters". It's always at the end of the project once the internationalization testing begins and there's always 3 weeks or so of trying to fix issues that ultimately boil down to how "special characters" are treated. I made the mistake of writing up some memos on my experiences and from there folks at work got the impression that I was on top of such things.
There's even a more recent pattern of errors that arise when one generates fragments of markup that are then aggregated as in feeds and in portlets generating web pages. Here, you better hope that the parent or outer container has 'done the right thing' and declared the right character encoding in the right place, http header, meta tag or declaration because it's too late by the time you are generating your output.
And to think that in this case it's just an é symbol I was curious about. It wasn't even a quote or ampersand. Would it matter if I was posting in arabic or chinese? These are the types of character encoding issues that drive Sam Ruby and Mark Pilgrim (back when he was on the internet) to distraction.
Inquiring minds want to know: who mangled poor Rokia Traoré?
File under: web, technology, i18n, localization, programming, forensics, Rokia Traoré, internationalization, investigation, udell, toli