Monday, February 21, 2005

Udell vrs Traoré?

So Jon Udell, in the midst of demonstrating the increasing maturity of query languages (XPath and XQuery) on xml databases, generates statistics of bloggers he reads who most frequently cite books on

In analyzing the data, it is clear that I'm one of those who mines the zeitgeist, peppering my posts with literary and musical references. That's not news to anyone who reads me and indeed its for that reason that I moonlight at that sinister cabal called Blogcritics.

What drew my interest however was a little technical quirk. If you look at the results for this blog you'll notice a link formatted as follows.

Cooking with Rokia Traoré

Thus Traoré begat

I wondered: what happened here?

Now I created my post in a text editor, Notetab, on Windows. I have muscle memory so that typing Alt + 0233 for the e-acute symbol (é) is no problem. I find it faster to typing that than recalling the alternative html entity for that symbol

Blogger generates an Atom 0.3 xml feed encoded in UTF-8 and if I inspect the feed, I see the following:
<title mode="escaped" 
type="text/html">Cooking with Rokia Traoré</title>

In other words, the Alt + 0233 character is preserved in the feed. Good. At least Blogger is out of the loop.

The question then is how did an escaped Alt + 0233 become
once it reached into Jon's xml datastore and was extracted by an xquery?

Ever since I implemented the html export features in Freelance Graphics back in 1997, the bane of my existence in every project has been "special characters". It's always at the end of the project once the internationalization testing begins and there's always 3 weeks or so of trying to fix issues that ultimately boil down to how "special characters" are treated. I made the mistake of writing up some memos on my experiences and from there folks at work got the impression that I was on top of such things.

But it's not clear that that there's any real answer to encoding and character issues. Sometimes it's just a simple bug, more often though, it requires forensic investigations and following a data trail. It's all about the confusing interactions between different markup languages whether html or xml, the programming languages used whether C, Java or Javascript (ever try using Javascript data arrays as your transfer format?), the formatting rules of your resource files, the transfer and wire protocols whether http, corba/iiop, configuration issues on server software, by-fiat decisions and bugs in various browsers (I'll just sniff some characters in your data and make a best guess as to the format and silently ignore the specified encoding), operating systems and databases programs to cite just a few of the actors in this area.

There's even a more recent pattern of errors that arise when one generates fragments of markup that are then aggregated as in feeds and in portlets generating web pages. Here, you better hope that the parent or outer container has 'done the right thing' and declared the right character encoding in the right place, http header, meta tag or declaration because it's too late by the time you are generating your output.

And to think that in this case it's just an é symbol I was curious about. It wasn't even a quote or ampersand. Would it matter if I was posting in arabic or chinese? These are the types of character encoding issues that drive Sam Ruby and Mark Pilgrim (back when he was on the internet) to distraction.

Inquiring minds want to know: who mangled poor Rokia Traoré?

File under: , , , , , , , , , ,


Anonymous said...

Who mangled him? Me, undoubtedly. I'll look into how and let you know.

Anonymous said...

Oops. The 'anonymous' should have been 'Jon Udell' :-)

Anonymous said...

Hmm. I just ran your Atom feed through the current version of Mark Pilgrim's The parsed content, a Python Unicode object, contains:

Thus Traor&#195;&#169; begat


Anonymous said...

Jon, I'm not seeing that.

>>> import feedparser
>>> feedparser.__version__
>>> d = feedparser.parse('')
>>> d.entries[1].content[0].value
u'...Thus Traor\xe9 begat...'

Anonymous said...

> Who mangled him?

"he" is a she. a she with a beautiful voice.

Anonymous said...

Who mangled Jon Udell?

Jon Udell, of course.

The é character is U+E9 in Unicode. In utf-8, that is encoded into two bytes: 0xC3A9.

All of this was faithfully conveyed to Jon. Jon's page, however, is served as iso-8859-1. In iso-8859-1, each byte corresponds to a character. 0xC3=&195; and 0xA9=&169;.

So Jon's page returns garbage in place of the desired é.

Anonymous said...

I should have said that not all mysteries are resolved.

If you ask Joe Sixpack what "0xC3A9" looks like in iso-8859-1, he will, after a moment's hesitation, respond "é".

Jon's software is, evidently, more sophisticated. It knows that these '8-bit' characters are not what is desired and escapes them.

How it knows this is still mysterious.