S-Code Chinese Computing Project

Chinese Character Sets and Encoding Systems

First thing first, I shall explain the encoding systems for Chinese characters. There is practically only one character set and encoding system for Simplified Chinese Characters that are used in Mainland China and a few places like Singapore and Malaysia. However, there are a bunch of character sets and encoding systems for Traditional Chinese Characters that are used in Taiwan and a few other places. The most popular code, which is not defined nor supported by the government, is called Big-5. The government on Taiwan defined a Chinese National Standard (CNS) codes for information exchange in 1986 and 1992.

I have written a few essays on CNS and Big-5 codes:

CNS vs Big-5 and its PostScript version (321 KB) (in Chinese).
CNS encoding system (in Chinese).
Big-5 encoding system (in Chinese).

And there are some short documents I copied from ifcss.org.

Tables of Big-5 characters and codes. They are grouped by the first bytes of the character codes. PostScript files.

Symbols (101KB).
Frequently used characters (2.1MB).
Less-frequently used characters (2.9MB).

The S-Code Project

In order to prepare a set of Chinese Data Processing utilities that can be used by any code, I designed an internal coding system based on the CNS-1992 code. Each character, either in ASCII, Latin-1, or various Chinese codes, is converted and stored in a structure of four bytes, in fact, an integer. The data processing programs are all written in this internal code. By this means I hope my efforts can be spent on the design of data processings instead of code manipulations.

Indeed, any code can be the internal code. The reason that I don't want to create a new code is obvious: let the experts do their job. The reasons that I choose CNS are

It is never-the-less a national standard.
It seems to cover a lot of characters (although I already know there are a few missing characters).

The name of this internal coding system is called S-Code, it was designed and implemented in late 1995. Although I have some second thoughts thereafter, but since it works so I do not want to change it any more. The implementation of S-Code consists of a suite of I/O and conversion programs. The application programs are groups into following four levels.

S-Code definition (in Chinese): The design and definition of S-Code.
User Level (in Chinese): The functions that are designed for most users.
Application Level (in Chinese): The lower level functions that might be used by some users to build applications.
Kernel Level (in Chinese): The functions that do most of the real work; shall not be concerned with most users.
Internal Level (in Chinese): The utility functions that are only used by s-code implementations, shall not be concerned with any users.

Many upper level I/O functions need a code to specify source/destination Chinese encoding system. The available systems and their corresponding defined integer are listed below.

0 SCODE: The S-Code.
5 BIG5: The Standard (so to speak) Big-5 code.

0	SCODE

5	BIG5

Here are some examples, and also useful utilities, that are written with S-Code.

A Chinese word count program: zc (zi count).

Back to the home page of Wei-Chang Shann.
Connect to the home page of Department of Mathematics, National Central University, Taiwan.

shann@math.ncu.edu.tw