S-Code Chinese Computing Project

Chinese Character Sets and Encoding Systems

First thing first, I shall explain the encoding systems for Chinese characters. There is practically only one character set and encoding system for Simplified Chinese Characters that are used in Mainland China and a few places like Singapore and Malaysia. However, there are a bunch of character sets and encoding systems for Traditional Chinese Characters that are used in Taiwan and a few other places. The most popular code, which is not defined nor supported by the government, is called Big-5. The government on Taiwan defined a Chinese National Standard (CNS) codes for information exchange in 1986 and 1992.

  • I have written a few essays on CNS and Big-5 codes:
    1. CNS vs Big-5 and its PostScript version (321 KB) (in Chinese).
    2. CNS encoding system (in Chinese).
    3. Big-5 encoding system (in Chinese).
  • And there are some short documents I copied from ifcss.org.
    1. CNS 11643-1986 (old CNS).
    2. Big-5.
    3. Big-5 vs CNS 11643-1986.
  • Tables of Big-5 characters and codes. They are grouped by the first bytes of the character codes. PostScript files.
    1. Symbols (101KB).
    2. Frequently used characters (2.1MB).
    3. Less-frequently used characters (2.9MB).
  • The S-Code Project

    In order to prepare a set of Chinese Data Processing utilities that can be used by any code, I designed an internal coding system based on the CNS-1992 code. Each character, either in ASCII, Latin-1, or various Chinese codes, is converted and stored in a structure of four bytes, in fact, an integer. The data processing programs are all written in this internal code. By this means I hope my efforts can be spent on the design of data processings instead of code manipulations.

    Indeed, any code can be the internal code. The reason that I don't want to create a new code is obvious: let the experts do their job. The reasons that I choose CNS are

    1. It is never-the-less a national standard.
    2. It seems to cover a lot of characters (although I already know there are a few missing characters).

    The name of this internal coding system is called S-Code, it was designed and implemented in late 1995. Although I have some second thoughts thereafter, but since it works so I do not want to change it any more. The implementation of S-Code consists of a suite of I/O and conversion programs. The application programs are groups into following four levels.

    S-Code definition (in Chinese)
    The design and definition of S-Code.
    User Level (in Chinese)
    The functions that are designed for most users.
    Application Level (in Chinese)
    The lower level functions that might be used by some users to build applications.
    Kernel Level (in Chinese)
    The functions that do most of the real work; shall not be concerned with most users.
    Internal Level (in Chinese)
    The utility functions that are only used by s-code implementations, shall not be concerned with any users.

    Many upper level I/O functions need a code to specify source/destination Chinese encoding system. The available systems and their corresponding defined integer are listed below.

    0SCODE
    The S-Code.
    5BIG5
    The Standard (so to speak) Big-5 code.

    Here are some examples, and also useful utilities, that are written with S-Code.


    Created: Dec 27, 1995
    Last Revised: Jan 14, 1996
    © Copyright 1995, 1996 Wei-Chang Shann

    shann@math.ncu.edu.tw