Creating Chinese & Japanese Word Clouds in Python

A friend of mine is interested in creating Japanese word clouds using his own tweets. He is a bright young math student from Taiwan. He can fluently speak Japanese as well as English, Taiwanese and (of course) Chinese. He expresses his thoughts in Japanese every now and then; he has many Japanese tweets in addition to English and Chinese. It will surely be fun to make clouds with them. This is how I have got interested in making my own clouds with our conversations which involve English, Japanese, Chinese and other languages.

Making Chinese & Japanese word clouds, by the way, is harder than it looks. Words are not segmented in both languages. To make clouds, you first need to detect every single word in sentences correctly. Here is an example sentence from Wikipedia : 大安區位於中華民國臺北市的中心偏南,是臺北市人口最多的行政區。 (if your computer displays just blank squares or rectangular boxes with “X” or question mark inside, please install Chinese language pack onto your Windows or fonts onto your Linux computer). As you see, there is no white space between the words and punctuation marks. But there is no need to be upset either. Thanks to the state-of-the-art part-of-speech (POS) taggers and morphological analyzers, it is surprisingly easy to split sentences into separated words. Although their outputs are not always one-hundred percent accurate, they return acceptable results most of the time. Almost always on non-colloquial, formally written text.

I am going to write about extraction of Chinese/Japanese sentences in part 1. and usages of a POS tagger and Morph. analyzer in part 2. And finally making clouds from word frequencies in part 3.

Before proceeding further, I would ask you to set your terminal charset to UTF-8. I also display the version information for Python, its packaging manager and my operating system.




Part 1. Extract Chinese/Japanese sentences

Because we always switch from speaking one language to another while talking, our first task is to extract Chinese or Japanese sentences from our conversations where more than 3 languages are mixed up!

If you are thinking to make word clouds from sentences written solely in Chinese or Japanese, please skip this part.

I conducted a sentence classification task using “guess-language” library.

Part 2.1 Segment Chinese Sentences using Stanford Word Segmenter

I choose Stanford segmenter. It is a modern word segmenter written in Java. It is very easy to use and works efficiently on Chinese and Arabic text. The software can be downloaded from https://nlp.stanford.edu/software/segmenter.html. Please download the latest version and unzip the downloaded file.

And then run “segment.sh” script with option parameters “pku”, “UTF-8”, and “0”. They specify the corpus — either Peking University Treebank (pku) or Penn Chinese Treebank (ctb), character encoding and the size of the N-best lists for each sentence.

The segmenter splits sentences into elements. Once completed, you get the sentences with the words separated.

You now can count the frequency usage of each word, right? Before jumping to conclusions, I would like you to consider avoiding counting function words. The function words are grammatical words. Some examples of function words in English are auxiliary verbs, determiners and prepositions.

They appear way too often in text and harm statistics. Furthermore, they alone do not have significant meanings. The above Chinese sentences also have ‘的’ possessive particle and ‘及’ conjunction. By removing these, the cloud becomes more informative, attractive and interesting.

I propose to extract only (proper) nouns using a POS tagger so you can make explanatory word clouds. If you agree, please download Stanford tagger from https://nlp.stanford.edu/software/tagger.html and run the software.

The POS tagger assigns parts of speech to each word. When you examine the output, you will soon find the parts of speech labeled at the end of the words. If you are not familiar with them, you can find the complete list of the tags at https://verbs.colorado.edu/chinese/posguide.3rd.ch.pdf.

It has become very easy to extract (proper) nouns from the text. If you want to include verbs, adverbs and adjectives, please add “VV”, “AD” and “JJ” into the POS_TAGS string in the line 6. And save the script as “count_word_zh.py”.

Execute the count_word_zh.py script and make a word frequency ranking list (wf_zh.txt).

You can make a Chinese word cloud with this word frequency list file. Please proceed to Part 3. Generating Word Clouds.

Part 2.2 Segment Japanese Sentences using MeCab

Despite its all complex features — such as its mix of writing systems, verb conjugations and null subjects/objects — Japanese language processing can be carried out in the same way as other languages. This is largely due to the amount of available language resources and powerful processing tools. There are a large number of corpora, both annotated and unannotated, and several sentence tokenizers out there.

MeCab is one of the most widely-used software of its sort. The program is used not only by researchers and professional programmers but also non-professional, hobby corders and analysts with no programming skills to analyze Japanese sentences.

You can install the software using a package manager or download one from https://github.com/taku910/mecab.

If you would like to use the word segmentation function, set the program’s output parameter to “wakati” — abbreviation for wakachi-gaki (written with spaces).

It is also a good idea to prepare a relatively new dictionary which contains many proper nouns and compound nouns.

Once you install the Neologd dictionary, you can use it together with the analyzer. Please specify the path of the dictionary, when you run MeCab.

When you execute the program, MeCab assigns parts of speech, basic form and reading aids to each word of the given sentences.

All the information is provided in Japanese, nevertheless you do not have to be fluent in the language to make word clouds. What is important is you detect the part of speech the words belong to. Here I show a small list of the major parts of speech in Japanese. You can find the complete list at http://www.omegawiki.org/Help:Part_of_speech/ja.

romanized definition Corresponds to
品詞 Hinshi parts of speech
名詞 meishi noun NN,NR
代名詞 daimeishi pronoun PN
動詞 doushi verb VV
形容詞 keiyoushi adjective AD,JJ
副詞 fukushi adverb AD
接続詞 setsuzokushi conjunction CC,CS
助詞 joshi particle

With this list and the MeCab output (tmp03_ja.txt), you can extract (proper) nouns from your input sentences.

If you want to include verbs, adverbs and adjectives, please add “動詞”, “副詞” and “形容詞” into the POS_TAGS string in the line 6.

And if you want to extract only proper nouns, please replace “名詞” with “固有名詞”, which is Japanese counterpart for proper nouns, in the line 6. Delete the line 18 and uncomment 19.

Then save the script as “count_word_ja.py” and execute it.

Part 3. Generating Word Clouds

Congratulations! You now have almost everything you need to create word clouds in Python — except WordCloud library. Please install the library.

And execute the following script with adding the input file path (word frequency file), the output file path (png image file) and the path of Chinese/Japanese fonts.

For Chinese:

For Japanese:

Here are samples of word clouds created using Wikipedia articles.

阿里山國家風景區 Alishan National Scenic Area in Chinese Language

線形分類器 Linear classifiers in Japanese Language

大安區 (臺北市) Taipei Da’an District in Chinese Language

See also: Chinese Word Counting Made Easy with the Command Line

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us