This simple segment of program counts number of Western words, number of Chinese characters, number of two-byte-width punctuation marks, from stdin. It writes, in this order, to stdout.
A Western word is defined to be any ASCII (plus a few Latin-1 codes) string that is separated by an ASCII blank space or by a Chinese character. A Chinese character is defined to be any Big-5 code, and a two-byte-width punctuation mark is determined by the S-code function s_punct. One can substract the third number from the second number to get a closer estimate on how many Chinese words are in a document.
For example, given the input
This is a測驗,and there is a全形 space.The UNIX wc thinks there are 6 words, note that there is a two-byte-width space between
is
and a
on
the second line.
But zc thinks there are 8 Western words, 6 Chinese characters,
among which 2 are two-bute-width punctuations.
A full sized zc shall check more command line arguments, and may take more than one input file. But the basic operations are here.
#include <stdio.h> #include "s_code.h" #define IN 1 /*inside a word */ #define OUT 0 /* outside a word */ main() { int c, np, nw, nz, state; state = OUT; nw = nz = np = 0; while ((c = s_getchar(BIG5)) != EOF) { if (s_wd(c) == 2) { ++nz; if (s_punct(c)) ++np; if (state == IN) state = OUT; } else { if (s_space(c)) state = OUT; else if (state == OUT) { state = IN; ++nw; } } } printf("\t %d %d %d\n",nw, nz, np); }
Created: Jan 14, 1996
Last Revised: Jan 14, 1996
© Copyright 1996 Wei-Chang Shann