Martin Probst's weblog

Content sanitation, html5lib and Iñtërnâtiônàlizætiøn

Thursday, November 29, 2007, 16:13 — 0 comments Edit

As I wrote, I migrated to a handwritten blog engine mainly because I was unsatisfied with the way Wordpress handled my content*. So one of the goals was to properly handle any input HTML and Unicode characters.

Unicode

Unicode support turned out to be more tricky than expected. I decided that for anything written mostly in a western European language, there is only one encoding and it’s called UTF-8. Debugging Unicode issues can be quite ugly, as it can be quite difficult to find out how something is actually encoded. Useful utility: hex editor to find out what these bytes really are. Sadly, that doesn’t help much with MySQL.

First thing is to make sure that really every single part of the tool chain is unicode aware. There is a nice collection of tipps here. In my case, LC_ALL on my server had to be set to en_US.UTF-8, my MySQL tables had somehow been created as non-unicode. The original wordpress database had a totally bizarre mix of unicode and non-unicode columns in every table.

A very useful command in MySQL is

mysql> show create table <tablename>;
Watch out for the DEFAULT ENCODING and per-column encodings.

Also important is to run all mysql scripts with the proper charset set, it appears to default to some latin charset:

mysql -u … -p –default-character-set=utf8

HTML

I’m preprocessing all data from the outside using html5lib. html5lib will parse anything and produce a DOM tree that is similar to what a browser would create. I added some code to wrap plain text outside of block level elements in <p/> containers.

It works nice, although it’s quite slow. One caveat: to html5lib, UTF-8 is called ‘utf-8’, not ‘utf8’. You won’t notice your Babylonian problems until a German U-Umlaut Ü shows up as the character ‘端’ - probably some broken auto detection.

Anyways, now my database contents look good :-)


No comments.