ZC: Zi Count

This simple segment of program counts number of Western words, number of Chinese characters, number of two-byte-width punctuation marks, from stdin. It writes, in this order, to stdout.

A Western word is defined to be any ASCII (plus a few Latin-1 codes) string that is separated by an ASCII blank space or by a Chinese character. A Chinese character is defined to be any Big-5 code, and a two-byte-width punctuation mark is determined by the S-code function s_punct. One can substract the third number from the second number to get a closer estimate on how many Chinese words are in a document.

For example, given the input

This is a測驗，and
there is　a全形 space.

The UNIX wc thinks there are 6 words, note that there is a two-byte-width space between is and a on the second line. But zc thinks there are 8 Western words, 6 Chinese characters, among which 2 are two-bute-width punctuations.

A full sized zc shall check more command line arguments, and may take more than one input file. But the basic operations are here.

#include <stdio.h>
#include "s_code.h"

#define IN 1  /*inside a word */
#define OUT 0 /* outside a word */

main() {
    int c, np, nw, nz, state;
    state = OUT;
    nw = nz = np = 0;
    while ((c = s_getchar(BIG5)) != EOF) {
	if (s_wd(c) == 2) {
	    ++nz;
	    if (s_punct(c)) ++np;
	    if (state == IN) state = OUT;
	}
	else {
	    if (s_space(c))
		state = OUT;
	    else if (state == OUT) {
		state = IN;
		++nw;
	    }
	}
    }
    printf("\t %d %d %d\n",nw, nz, np);
}

Back to the home page of Wei-Chang Shann.
Connect to the home page of Department of Mathematics, National Central University, Taiwan.

shann@math.ncu.edu.tw