Unicode/ISO-10646

ISO 10646 was originally conceived as approximately a disjoint union of national standards, yielding a 32-bit internal code with compact stateful external encodings. Unicode was a competing standard, aimed at unifying identical characters from different variants of the same script, to give a 16-bit code. They have since been unified: ISO/IEC 10646-1:1993 has two forms, a 31-bit form (UCS-4) and a 16-bit form (UCS-2). UCS-2 is a subset of UCS-4 (plane 0 of group 0), and is code-for-code identical with Unicode 1.1 (though Unicode imposes extra semantics on some characters). The aim is to cover all the (reasonably widely-used) scripts in the world.

ISO/IEC 10646 was adopted as Chinese National Standard GB 13000 in December 1993. A Japanese translation is in the process of being adopted as a Japanese standard, JIS X0221.

Unicode 1.0 is described by the Unicode books. The changes to make Unicode identical to ISO 10646 UCS-2, yielding first Unicode 1.0.1 and then Unicode 1.1, are relatively minor, and do not affect the hanzi set. They are available as Unicode Technical Report #4. Various tables and appendices describing Unicode 1.1, plus mapping tables to various CJK codes are available at the Unicode site. See also the Unicode page maintained by Glenn Adams.

UCS-2/Unicode

The overall layout of this unified 16-bit code (also called the Base Multilingual Plane or BMP of ISO-10646) is:

0000 - 1FFF
A-ZONE: Alphabets: The first 256 codes are identical to ISO 8859-1 (Latin-1), except that the 65 control codes are excluded.
2000 - 2FFF
A-ZONE: Symbols and Punctuation
3000 - 4DFF
A-ZONE: CJK Auxiliary
4E00 - 9FFF
I-ZONE: CJK Unified "Ideographs"
A000 - DFFF
O-ZONE: Reserved for future assignment (but see UTF-16 for a proposed use)
E000 - FFFD
R-ZONE: Restricted use - the codes FFFE and FFFF are excluded.

All characters are also assigned names, but for the "ideographs" these are merely "CJK UNIFIED IDEOGRAPH-4E00" and so on.

The rest of UCS-4

In the first version of ISO/IEC 10646, groups 60-7F and planes E0-FF of group 0 are reserved for private use, while all other planes are reserved for future standardization. This is expected to change in the forthcoming revision: groups 01-0E of plane 0 will become available for definition, groups 0F and 10 of plane 0 will become available for private use, while the rest of the code space will be reserved (see UTF-16).

Combining characters

These are characters for graphemes such as Semitic and Indic vowels, IPA diacritics, tone marks and Hangul jamos. These characters are placed after the base character; the rendering process is expected to generate the composite glyph, in some cases by a table lookup. ISO 10646 defines 3 levels of subset implementation:

level 1:
base characters only.
level 2:
base characters and certain combining characters, roughly those required by Semitic and Indic-derived scripts.
level 3:
all characters.

Many redundant "presentation forms" are also provided (especially in the additions that make Unicode 1.1), including the tone-letter combinations used in Hanyu pinyin (but not those used in Zhuyin fuhao) and pre-composed Hangul syllable blocks.


Part of Notes on CJK Character Codes and Encodings.