Martin Probst's weblog

XML encoding problems

Wednesday, February 9, 2005, 11:21 — 3 comments Edit

Attractive Nuisance contains a link to an interesting slideshow about XML encoding problems, starting from charsets going over to attribute order, whitespaces, entities, double escaping etc.

The slide titled "QNames" just reads "don't even get me started". If everyone understands that XML namespaces are broken, why doesn't anyone really do something about it? The W3C seems to have failed in this issue, both with ambigous URIs, difficult to make out QNames and a really strange way of namespace scoping.

My problem is not with namepaces in general, but with a specific application, best described by an example. Take a look at

The envelope in the first example declares five namespaces. Only four are ever used, but that is OK. I’d like to focus on the xsd namespace. It is not used in association with the name of any element, or with the name of any attribute, it is only used WITHIN A STRING.

A general purpose XML parser can help by keeping track of namespaces and expanding the prefixes for you. However, it can’t know whether or not the colon within the string is significant. It can only provide a parsed document. This means that not only does the parser need to keep track of namespaces, but the caller does too.

That is indeed another “double-escaping” problem. Though a schema aware parser should read that and interprete it as a whichs contents are an instance of xsd:string, shouldn’t it?

Essentially a “shema aware processor” is the bulk of what a web service toolkit is, and I contributed heavily to one of these: Apache Axis.

The short answer to your question is: yes, but a more complete answer is that writing such a schema aware parser turns out to be a real PITA. Axis is written in Java, and can use a JAXP compliant SAX parser (like Apache Xerces) which takes care of a lot of namespace administrivia. A good portion of this must be duplicated in Apache Axis in order to handle this one attribute.