ISO 10646 was originally conceived as approximately a disjoint union of national standards, yielding a 32-bit internal code with compact stateful external encodings. Unicode was a competing standard, aimed at unifying identical characters from different variants of the same script, to give a 16-bit code. They have since been unified: ISO/IEC 10646-1:1993 has two forms, a 31-bit form (UCS-4) and a 16-bit form (UCS-2). UCS-2 is a subset of UCS-4 (plane 0 of group 0), and is code-for-code identical with Unicode 1.1 (though Unicode imposes extra semantics on some characters). The aim is to cover all the (reasonably widely-used) scripts in the world.

ISO/IEC 10646 was adopted as Chinese National Standard GB 13000 in December 1993. A Japanese translation is in the process of being adopted as a Japanese standard, JIS X0221.

Unicode 1.0 is described by the Unicode books. The changes to make Unicode identical to ISO 10646 UCS-2, yielding first Unicode 1.0.1 and then Unicode 1.1, are relatively minor, and do not affect the hanzi set. They are available as Unicode Technical Report #4. Various tables and appendices describing Unicode 1.1, plus mapping tables to various CJK codes are available at the Unicode site. See also the Unicode page maintained by Glenn Adams.


The overall layout of this unified 16-bit code (also called the Base Multilingual Plane or BMP of ISO-10646) is:

0000 - 1FFF
A-ZONE: Alphabets: The first 256 codes are identical to ISO 8859-1 (Latin-1), except that the 65 control codes are excluded.
2000 - 2FFF
A-ZONE: Symbols and Punctuation
3000 - 4DFF
A-ZONE: CJK Auxiliary
4E00 - 9FFF
I-ZONE: CJK Unified "Ideographs"
A000 - DFFF
O-ZONE: Reserved for future assignment (but see UTF-16 for a proposed use)
E000 - FFFD
R-ZONE: Restricted use - the codes FFFE and FFFF are excluded.

All characters are also assigned names, but for the "ideographs" these are merely "CJK UNIFIED IDEOGRAPH-4E00" and so on.

The rest of UCS-4

In the first version of ISO/IEC 10646, groups 60-7F and planes E0-FF of group 0 are reserved for private use, while all other planes are reserved for future standardization. This is expected to change in the forthcoming revision: groups 01-0E of plane 0 will become available for definition, groups 0F and 10 of plane 0 will become available for private use, while the rest of the code space will be reserved (see UTF-16).

Combining characters

These are characters for graphemes such as Semitic and Indic vowels, IPA diacritics, tone marks and Hangul jamos. These characters are placed after the base character; the rendering process is expected to generate the composite glyph, in some cases by a table lookup. ISO 10646 defines 3 levels of subset implementation:

level 1:
base characters only.
level 2:
base characters and certain combining characters, roughly those required by Semitic and Indic-derived scripts.
level 3:
all characters.

Many redundant "presentation forms" are also provided (especially in the additions that make Unicode 1.1), including the tone-letter combinations used in Hanyu pinyin (but not those used in Zhuyin fuhao) and pre-composed Hangul syllable blocks.

Part of Notes on CJK Character Codes and Encodings.