Difference between revisions of "Text Representation"
(init) |
(No difference)
|
Revision as of 21:04, 9 April 2013
For most purposes it is quite simple to deal with text, or at least is seem quite simple. But from time to time you run into situations that make you realize that there are some non-trivial complications.
This article tries to explain the non-trivialities.
But let us start with the universal simplicity:
- Text is a sequence of characters.
It is this simple observation that makes most text manipulation quite simple. Staying within a single programming language, on a single machine, for a single user, ... will typically mean that everything just works.
But when transferring text things often breakdown, because:
- what exactly is a sequence?
- and what exactly is a character?
Sequence
Sequencing is typically done by placing the characters next to each other. The "difficult" part is to determine when the sequence terminates. In a file this can be when the file ends. In other cases there may be a specific fixed number of characters. In Visual Prolog (like in C/C++) strings are normally zero-terminated, meaning that zero is a reserved character which does not occur in normal text and which is then used to indicate string termination, i.e. after the last character in the actual string there is a zero char which indicates that there are no more characters in the string. This representation is used both for 8 bit strings and 16 bit strings.
COM (and Visual Basic) uses a string representation that first have a 32 bit number which indicates the number of characters in the string followed by the characters. This representation is only used for 16 bit strings. Since the length of the string is known such strings can contain zero-characters even though they are not valid characters. This is sometimes used (if something is possible, somebody will do it).
Character sets
In old days memory and storage was expensive and data exchange between different cultures was rare. Today memory and storage is much cheaper (per storage unit) and data exchange between different cultures is much more common.
During this change numerous text representations have existed, some are still in use and others are forgotten.
Character representation can be divided into the following three categories:
- Single-byte character set: Each character is represented as byte (8 bit number)
- Multibyte character set: Each character is represented as one or more bytes
- Wide character set: Each character is represented as one or more words (16 bit numbers)
Single-byte character sets is a special case of multibyte character sets where the "or more" possibility is not used. The main reason for mentioning single-byte character sets is to introduce the notion of codepages. Codepages is a Microsoft concept for dealing for handling character sets. A codepage is a map from numbers to the characters they represent, it could for example say that the number 87 represents the character 'W'. Many different codepages exist for dealing with different languages, but also for many other reasons (different communities used their own representation).
So a codepage describes the encoding of characters as bytes (and vice versa) and various codepages have been given names and Id's (i.e. a number) so that you can specify which character set you mean.
Microsoft has generalized the idea of codepage names and codepage id's to also identify character set encodings that are not really based on a codepage. I.e. multibyte character sets also have a codepage name and a codepage id even though the encoding is not described by a codepage.
Microsoft's codepage names and ids covers character set encodings used in many different contexts, and the character sets may have different names in other contexts. Codepage "Windows 1252" (id=1252) is for example the same as iso-8859-1, which it is called in HTML:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
While wide character sets above is described as a general concept, the only relevant usage is for Unicode, which we will discuss in the next section.
Unicode
Unicode defines a character set, but there are several different representations of this character set. Often Unicode is confused with one of its representations. But once you realize what this means you get a much better understanding of it. I hope to contribute to this understanding; pay attention to this statement:
- Unicode is an "abstract" character set with several alternative representations
The Unicode character set is divided into 17 planes, each containing 65.536 (2^16) code points. This gives a total of 17*2^16 = 1.114.112 code points. Each code point could be a character, but some of them are reserved for other special purposes and far most of the code points are currently unspecified (i.e. left for future usage/definition).
If we write the code points in hexadecimal they will range from 0x000000 to 0x10FFFF. I.e. the first two hex-digits gives the plane and the next four the code point with in that plane.
The representation utf32 represents each code point as a 32bit number in that range (0-0x10FFFF). It is seldom used for more than one character, because each character will occupy 32bits of which 11 is always zero), but for dealing with individual characters it is convenient to have the character as a single number.
You may wonder (at least I did at a certain point) why 17 planes? Couldn't 13 or 16 be just as good (10 of the planes are not currently in use). The reason for this and many other things is: even though "Unicode is an abstract character set with many representations", it is actually designed for one specific wide character representation. I.e. it is designed for a specific 16bit character representation. In this design representation most the most frequently used characters will occupy on 16bit number and all the rest can be within two 16bit numbers known as surrogate pairs (or just surrogates).
For this representation the following is the case: When looking at a single 16bit number in a string you can determine:
- whether it is one of the frequent one-word characters
- whether it is the first number of a surrogate pair
- whether it is the second number in a surrogate pair
Is can also be a Byte-Order-Mark (BOM) which is also distinct from all the other (or fall into some other special categories).
So Unicode is designed to do most things in 16bit and the rest in 32bits, and at the same time making it possible to immediately fall in sync when you start the interpretation at a random word.
This particular wide character representation is the one that Windows and subsequently Visual Prolog uses.
utf16 & utf16BE
Communication between computers is always based on sequences of bytes. So if a computer writes the byte sequence 0x01 0x02 0x03 0x04 any other computer should also read it as 0x01 0x02 0x03 0x04.
Multibyte character sets define characters as one or more bytes, so since it defines which order the bytes comes in the transfer between computers gives the expected result.
But when transferring wide char sets you need to transfer two bytes for each wide char, and the question is then which one should be transferred first. If you have a Unicode string encoded as in the wide character format with surrogate pairs mentioned above, and for each 16bit number first transmit the least significant byte followed by the most significant byte then you have the format utf16 (which for example in HTML is called Unicode). If you on the other hand first transfer the most significant byte and then the least significant byte then you have the format utf16BE (aka utf16 big-endian).
So in a Windows computer a wide character string is the same as a multibyte string in utf16 format, you should however notice that 8-bit string routines will not work on such strings because utf16 strings will/can contain zero-bytes, which for other 8bit string formats is considered to terminate the string.
When writing a file in utf16 or utf16BE format you can write a byte-order-mark (BOM) in the beginning of the file. The byte order mark is the number 0xFEFF, if it is written in little endian mode the first byte will be 0xFF and the second will be 0xFE in big endian mode they will come in different order, so the order of these two bytes will show whether the file is in little-endian or big-endian format. It should be noticed that neither 0xFEFF or 0xFFFE are otherwise used in utf16 and utf16BE. It is only custom to use BOM in files, and not for example when transmitting HTML pages from a WEB server, most browsers will show the BOM as two strange characters.
utf8
utf8 is another interesting multibyte encoding of Unicode. utf8 encode each code points as 1 to 6 bytes, so surrogate pairs in the wide character (utf16) format are first converted into (utf32) code points and then into utf8. When interpreting utf8 from a random byte it is possible to get in sync from the next code point, i.e. it is at most necessary to skip the rest of the current multibyte character, before coming in sync.
utf8 does not contain zero-bytes, so it can use zero termination.
Characters in 7-bit the ASCII are represented as a single byte in utf8, other characters require longer sequences. So text in western languages are represented quite compact whereas other languages are more compact in utf16.
utf8 files can also contain a byte-order-mark 0xEF 0xBB 0xBF (it is not used to indicate byte order, rather to indicate utf8 format).