Encodings of Unicode/ISO-10646

Unicode/UCS-2 text is intended to be represented internally (for processing) as 16-bit values. For transmission and external storage a number of byte encodings have been proposed.

Direct Encodings

The simplest is a direct representation as two bytes, most significant first, as specified in ISO-10646, and in the UNICODE-1-1 scheme registered for use with MIME (Multipurpose Internet Mail Extensions).

A variant (see informative Annex F of ISO-10646) allows the order of the two bytes to depend on the machine that produced the text, but to begin the stream with the character


which will establish the byte order, since FFFE is not a legal code.


The UCS Transformation Format (UTF, or UTF-1) for ISO-10646 (see informative Annex G of ISO/IEC 10646-1:1993) specifies a variable-length encoding avoiding C0, C1, DEL and SPACE octets, but appears to be dead already.


Another encoding, proposed by X-Open, is variously called UTF-8, FSS-UTF, UTF or UTF-2. It has been proposed as normative Annex P to ISO/IEC 10646. See also Hello World, by Rob Pike and Ken Thompson. Some properties of this encoding:


UNICODE-1-1-UTF-7 (UTF-7), an encoding of UCS-2 using only mail-safe bytes, has been registered for use with MIME. Some properties:

Conversion functions for UTF-8 and UTF-7 are available from the Unicode site.


UTF-16 (formerly UCS-2E) is not a byte encoding, but a scheme for representing certain UCS-4 codes in a UCS-2 stream. (Recall that no UCS-4 codes outside UCS-2 have yet been defined.) It has been proposed as normative Annex O to ISO/IEC 10646. Characters in the range 10000 to 10FFFF are represented by a pair of O-zone codes, the first in the range D800-DBFF and the second in the range DC00-DFFF.

Part of Notes on CJK Character Codes and Encodings.