Difference between revisions of "Unicode"

From wiki.visual-prolog.com

(Initial)
 
m (spelling)
Line 3: Line 3:
=== Encoding ===
=== Encoding ===


Unicode can be encoded in several ways.  In Windows (and thus Visual Prolog) the most interesting formats are UTF-8 and UTF-16 (Little Endian).  UTF is an actonym for Unicode Transfer Format.  UTF-8 is an 8-bit character set encoding where standard ANSII characters are kept as they are and more rare (in western languages) characters are encoded using two, three or four byte character sequences.
Unicode can be encoded in several ways.  In Windows (and thus Visual Prolog) the most interesting formats are UTF-8 and UTF-16 (Little Endian).  UTF is an acronym for Unicode Transfer Format.  UTF-8 is an 8-bit character set encoding where standard ANSII characters are kept as they are and more rare (in western languages) characters are encoded using two, three or four byte character sequences.


In UTF-16 Little Endian Unicode is encoded using sequences of 1 or 2 16-bit units. UTF-16 is used internally in Windows API calls.  Little Endian means that the least signifiant byte of each 16-bit unit is stored before the most significant byte.
In UTF-16 Little Endian Unicode is encoded using sequences of 1 or 2 16-bit units. UTF-16 is used internally in Windows API calls.  Little Endian means that the least significant byte of each 16-bit unit is stored before the most significant byte.


=== Byte-order mark ===
=== Byte-order mark ===
Line 11: Line 11:
A byte-order mark (BOM) is the Unicode character with hex value 0xFEFF.
A byte-order mark (BOM) is the Unicode character with hex value 0xFEFF.


It is optionally written in the start of Unicode files to clairify the byte order of the file.
It is optionally written in the start of Unicode files to clarify the byte order of the file.


{|{{prettytable}}
{|{{prettytable}}

Revision as of 23:49, 3 February 2008

Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.

Encoding

Unicode can be encoded in several ways. In Windows (and thus Visual Prolog) the most interesting formats are UTF-8 and UTF-16 (Little Endian). UTF is an acronym for Unicode Transfer Format. UTF-8 is an 8-bit character set encoding where standard ANSII characters are kept as they are and more rare (in western languages) characters are encoded using two, three or four byte character sequences.

In UTF-16 Little Endian Unicode is encoded using sequences of 1 or 2 16-bit units. UTF-16 is used internally in Windows API calls. Little Endian means that the least significant byte of each 16-bit unit is stored before the most significant byte.

Byte-order mark

A byte-order mark (BOM) is the Unicode character with hex value 0xFEFF.

It is optionally written in the start of Unicode files to clarify the byte order of the file.

Encoding Representation (hexadecimal)
UTF-16 Big Endian FE FF
UTF-16 Little Endian FF FE

It is also optionally written in UTF-8 files to indicate that UTF-8 format is used. In UTF-8 the values are EF BB BF. It can also be used in other Unicode transfer.

References