Chinese Word Counting Made Easy with the Command Line

Days ago, I wrote an article entitled creating Chinese/Japanese word clouds in Python. The article was written for a friend of mine who is learning the language for his research in mathematics and mathematical biology. By writing codes from scratch, one can learn data structures and time complexities. Hash tables and their worst case O(log n), binary search trees O(n) or whatever. Not that there is anything wrong with that. It just takes some time to think and write codes.

Instead of writing scripts, I usually get the same outputs from the GNU/Linux command line tools. As you may know, there are many other ways to do the same thing. And it is always a good idea to find easier, faster and/or more efficient ways. You do not always need to write your own code to get what you want.

Counting occurrences of Chinese nouns annotated by Stanford part-of-speech tagger, for instance, can be carried out by typing the following command.

It is a simple combination of 5 different commands which produces the same result as the “count_word_zh.py” script listed on my previous post.

sed performs basic text transformations manual
grep prints lines that contain a match for a pattern. manual
sort sorts lines manual
uniq removes duplicated lines manual
awk is used to rearrange the order of columns here manual
LC_ALL=C is an option to remove localized settings that affect the sorting and comparison results

Despite the fact that Chinese text is written in non-alphanumeric, multi-byte characters, you can still take advantage of the major functions of UNIX and UNIX-like operating systems.




The command line tools may help you save time not only in coding but also in running programs. If you handle large input files, you probably think it is time to execute more than one process simultaneously. Many hands make light work. There are also powerful tools that enable concurrent, and even parallel processing.

These are only a few examples of what you can do with the command line. You can actually do a whole lot more without writing a line of code. While I was doing my PhD in Informatics, I was even taught to avoid writing unnecessary code. The single most important thing is to get the results you want easily, quickly and efficiently. The command line tools will be of great help to you in accomplishing this goal.

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us