Encodings of Unicode/ISO-10646

Unicode/UCS-2 text is intended to be represented internally (for processing) as 16-bit values. For transmission and external storage a number of byte encodings have been proposed.

Direct Encodings

The simplest is a direct representation as two bytes, most significant first, as specified in ISO-10646, and in the UNICODE-1-1 scheme registered for use with MIME (Multipurpose Internet Mail Extensions).

A variant (see informative Annex F of ISO-10646) allows the order of the two bytes to depend on the machine that produced the text, but to begin the stream with the character

FEFF
ZERO WIDTH NO-BREAK SPACE (formerly BYTE ORDER MARK)

which will establish the byte order, since FFFE is not a legal code.

UTF-1

The UCS Transformation Format (UTF, or UTF-1) for ISO-10646 (see informative Annex G of ISO/IEC 10646-1:1993) specifies a variable-length encoding avoiding C0, C1, DEL and SPACE octets, but appears to be dead already.

UTF-8

Another encoding, proposed by X-Open, is variously called UTF-8, FSS-UTF, UTF or UTF-2. It has been proposed as normative Annex P to ISO/IEC 10646. See also Hello World, by Rob Pike and Ken Thompson. Some properties of this encoding:

UTF-7

UNICODE-1-1-UTF-7 (UTF-7), an encoding of UCS-2 using only mail-safe bytes, has been registered for use with MIME. Some properties:

Conversion functions for UTF-8 and UTF-7 are available from the Unicode site.

UTF-16

UTF-16 (formerly UCS-2E) is not a byte encoding, but a scheme for representing certain UCS-4 codes in a UCS-2 stream. (Recall that no UCS-4 codes outside UCS-2 have yet been defined.) It has been proposed as normative Annex O to ISO/IEC 10646. Characters in the range 10000 to 10FFFF are represented by a pair of O-zone codes, the first in the range D800-DBFF and the second in the range DC00-DFFF.


Part of Notes on CJK Character Codes and Encodings.