Pages

Wednesday, May 30, 2007

Byte Order Mark (BOM) Tales

Recently I encountered a problem that I never seen before. There is no error display on the system except this weird character ÿþ. I remember searching for the log files looking for clues but nothing could be relate to this character. Later I found out this ÿþ character is an encoding signature called byte order mark or BOM for a file, in this case it's a XSL transformation file. Generally, it's a particular sequence of bytes at the beginning of the file that indicates the encoding and the byte order. For example any file with this particular encoding will have a different BOM:

UTF-8 - EF BB BF - 

UTF-16LE - FF FE - ÿþ

UTF-16BE - FE FF - þÿ

The W3C specify that all XML processor must read the UTF-8 and UTF-16 encoding. This text explain that to differentiate between UTF-8 and UTF-16 a BOM must be present, and that the BOM must be used by the parser as encoding signature. Other encoding may be supported, but no parser is required to have support for all of them, or one in particular, besides the UTF-8 and UTF-16.

This encoding signature is not to be displayed, any tool that support Unicode will understand this and will not show this to you nor consider it to be part of the text file. By checking the Hexadecimal of the file or opening the file in a non-unicode text editor will give you those characters presented in the above.

Now I know what the meaning of the character, but now how could I relate this information to the current problem? Read on...

With IE and MSXML, there are two really common errors that happen when something is not correct with the steps defined below.

An invalid character was found in text content

A parser found a character on your file that is not according the encoding declaration or the BOM specified for that file. If you have character encoded with ISO-8859-1 and then speficy the UTF-8 on the encoding declaration, this will issue the error.

Switch from current encoding to specified encoding not supported

In a basic thinking this error is almost identical to the previous, the only thing is that the parser understands that the real encoding on the file is different from the one in the encoding declaration. What the error is trying to tell you is that it can’t make the switch from the file encoding to the one you specify on the encoding declaration.

Let say if a file encoded as UTF-8 and a text encoding of UTF-16, since UTF-16 must always be two bytes, the parser known forehand that something is wrong with the encoding.

So in my case, it was the first one, character encoded with ISO-8859-1 and then speficy the UTF-8 on the encoding declaration. Case close.

Note: If none works, maybe you should consider re-install the MSXML parser according to your code, because each version has different coding requirement.

No comments: