• +49-(0)721-402485-12
Ihre Experten für XML, XQuery und XML-Datenbanken

Unicode

From the very beginning, XML was created to be used globally. Therefore, it also refers to appropriate standards, such as Unicode and ISO/IEC 10646 allowing the encoding of all important letters and characters of the world, as well as RFC 1766 for the identification of languages.

The objective of the Unicode standard is to record all writing systems of the world. For this purpose, a natural number is assigned to the characters as codepoint. Such a codepoint is usually noted in the hexedecimal representation, with the U+ prefix in order to indicate that it is a Unicode codepoint. U+0020, for example, represents a space character. The codepoints up to and including U+00FF correspond to the codepoints of the Latin-1 (ISO8859-1) character set which is very common in Western Europe. In this character set, in turn, the characters up to U+007F correspond to the US-ASCII character set. XQuery supports an encoding in Unicode.

However, the assignment of codepoints does not specify how these numbers are displayed in a computer. In fact, there are various encoding methods for Unicode. The two most important encodings shall be explained in the following:

  • UTF-16
    The most obvious method is to encode each character with the same number of bytes. Since, in the first versions, Unicode was restricted to codepoints below 65535, it made sense to provide two bytes per character. Characters above 65535 are encoded by two times 2 bytes: Unicode includes a range of codepoints which is reserved especially for this encoding - the so-called surrogates.
  • UTF-8
    Especially for English texts the encoding in UTF-16 means a duplication of the volume compared to the encoding in ASCII. Here, an encoding in UTF-8 is more favourable which operates in the following way: codepoints below 128 are encoded in 1 byte, codepoints up to 2047 in 2 bytes, codepoints up to 65535 in 3 bytes and all the others in 4 bytes. This encoding is used so that texts which only consist of characters of US-ASCII (with codepoints below 128) can be processed as Unicode without conversion.

UTF-8 and UTF-16 are the two encodings each XML processor must recognise — of course, it is also possible that further encodings of other character sets are recognised by a XML processor.

 

Source: "XQuery – Grundlagen und fortgeschrittene Methoden", dpunkt-Verlag, Heidelberg (2004)

<< backnext >>