Tokenization of "word" things for CJK
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 2 2015, 1:56 PM

Description

Figure out a useful tokenization strategy for CJK languages.

Related Objects
Search...

Status	Assigned	Task
Open	None	T227094 Update RC Filters for new ORES capacities (July, 2019)
Resolved	SBisson	T225561 Update ORES thresholds for nlwiki
Open	None	T223273 Update srwiki thresholds for goodfaith model
Resolved	SBisson	T225562 Deploy ORES filters for zhwiki
Open	None	T225563 Deploy ORES filters for jawiki
Resolved	Halfak	T224484 ORES deployment: Early June
Resolved	Halfak	T224481 Train/test zhwiki editquality models
Resolved	Halfak	T223382 Improvements to ORES localization and support
Resolved	Halfak	T109366 Chinese language utilities
Open	None	T111178 Generate stopwords for CJK languages
Resolved	Pavol86	T111179 Tokenization of "word" things for CJK

Event Timeline

Halfak created this task.Sep 2 2015, 1:56 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2015, 1:56 PM

Liuxinyu970226 set Security to None.Sep 18 2015, 11:27 AM

Liuxinyu970226 subscribed.

Perhaps considering a Hidden Markov Model implementation. I believe Lucene 3.0 uses this approach for its CJK Tokenization.

Know where we could find one of those in python? I suppose we could also build our own if we had a sufficiently comprehensive set of words to learn the transition probabilities from.

At first glance, I would say we could use some treebanks such as https://catalog.ldc.upenn.edu/LDC2013T21 for Chinese, not sure about the others. Alternatively, there's http://cjklib.org/0.3/ which may be worth looking into as a starting point.

Although thinking about this more, you have to consider that ambiguity of meaning when segmenting "words" can lead to poor information retrieval issues.

https://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation

That being said, I came across an article about using Wikipedia as a resource for doing n-gram mutual information for word segmenting on Chinese. This method could potentially be applied to other languages.

http://www.cs.otago.ac.nz/homepages/andrew/papers/2009-9.pdf

Slightly unrelated but I thought this was interesting:

http://batterseapower.github.io/pinyin-toolkit/

And it leverages cjklib :)

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Mar 30 2016, 2:55 PM

Halfak added a project: revscoring.Mar 30 2016, 5:07 PM

Halfak moved this task from Unsorted to Ideas on the Machine-Learning-Team board.

We can probably use ngrams in hashing vectorization to capture this type of signal. That might be easier than explicitly splitting words. See T128087

Then again, splitting words would be good for dictionary lookups.

Halfak triaged this task as Lowest priority.Aug 18 2016, 2:43 PM

Halfak added a parent task: T111178: Generate stopwords for CJK languages.

Arthur2e5 added a parent task: T109366: Chinese language utilities.Mar 5 2017, 8:04 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 3 2017, 3:54 AM

Harej moved this task from Ideas to New development on the Machine-Learning-Team board.Apr 3 2019, 5:02 AM

Shizhao added a project: Chinese-Sites.Apr 9 2019, 9:01 AM

Viztor subscribed.Jun 18 2019, 8:20 AM

I can see that this thread is quite old, so first of all only couple of notes that we can talk about with Aaron on Thursday call..

NOTES:

this post nicely summarizes approaches for CJK tokenization I found in papers/posts/blogs etc. : http://www.solutions.asia/2016/10/japanese-tokenization.html
as mentioned previously Apache Lucene has a CJK tokenization strategy that could be adapted; they use bigrams (N-Gram approach from previous post?): https://lucene.apache.org/core/8_5_2/analyzers-common/index.html
1. we could use some ideas from this post (use of nltk + chargua) : https://stackoverflow.com/questions/43576136/how-can-i-use-python-nltk-to-identify-collocations-among-single-characters
2. or ngram parameters minn/maxn from fasttext (if we want to implement it in python-mw) and extract ngrams ("idea in progress"), see links:
  - https://fasttext.cc/docs/en/options.html
  - here they mentione tokenization of some east asian languages : https://fasttext.cc/docs/en/crawl-vectors.html
from fasttext post I got to standford nlp projects:
- word segmenter for chinese and arabic implemented also in nltk : https://nlp.stanford.edu/software/segmenter.html
- and MY FAVOURITE Stanza project: https://github.com/stanfordnlp/stanza

I am not yet sure how to exatly we can use it, but what I like about Stanza is:

it's recent (paper published 2020)
it has nice documentation
it's not only a wrapper for some other C#/Java/etc code(if I understand it correctly..)
it has research/scientific background
from spacy stanza github

The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labelled dependency parsing in 58 languages.

it always sounds better when you implement something tested at Stanford/MIT/Harwad/etc. then just some commonly used approach :)

Useful links:

Github: https://github.com/stanfordnlp/stanza
there is Spacy and Spacy STANZA - Spacy is main project Stanza is part of it (I think? not sure yet) https://github.com/explosion/spacy-stanza
I do not understand yet how they do tokenization for east asian languages, but they have them listed in models : https://stanfordnlp.github.io/stanza/models.html
arxiv paper related to Stanza : https://arxiv.org/pdf/2003.07082.pdf

Short summary
If the idea is to make only a slight change to code to improve python-mw and drafttopic then let's focus on extracting bigrams. If the idea is to test couple of approaches to create an "east asian languages tokenization strategy" then let's try both with focus on understanding how does Stanza work and how can we use it.

@Pavol86 Stanza has poor performance compare to other Chinese word segmentation tool such as Jieba, pkuseg and THULAC. (I still cannot forget how horrible CoreNLP’s processing time is for CWS). HMM should also be the worst choice to be considered. There are significant difference between Chinese, Japanese and Korean tokenization strategy, and treating them as one "east asian languages tokenization strategy" will never work in my opinion.

I think we can start by focusing how pkuseg works on Chinese word segmentation and how Mecab works for Japanese word segmentation. (I am not sure about Korean, as I do not speak Korean, but Mecab-ko can be a good start)

@VulpesVulpes825 thank you for your response. Do you know if there are any datasets/dictionaries that are used for benchmarking of tokenization methods?

Japanese:
I found a common interface for 3 popular Japanese tokenizers:
https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers
It groups Mecab(mentioned by you and in FastText documentation), Juman and KyTea
I want to compare performance of the most popular Chinese and Japanese tokenizers, but I need some bechmarking dataset.

Chinese
I found this dataset from 2005 (quite old, but I guess for language dataset it should not matter that much):
http://sighan.cs.uchicago.edu/bakeoff2005/
I found pkuseg(mentioned also by you), THULAC, jieba and Stanford word segmeter(used in Fasttext documentation). The package authors in THULAC listed the dataset mentioned above.

pkuseg https://github.com/lancopku/pkuseg-python
THULAC https://github.com/thunlp/THULAC-Python
jieba https://github.com/fxsjy/jieba
Stanford word segmenter https://nlp.stanford.edu/software/segmenter.html

Korean
so far I only found KoNLPy https://konlpy.org/en/latest/

Vietnamese
vietseg https://github.com/manhtai/vietseg
pivi ? https://github.com/thangntt2/pivi

Summary:
Let's focus on Chinese and Japanese. Do you know of any Japanse dataset?

Note

Export from jupyter notebook after I struggled with JapaneseTokenizers I decided to get at least 1 working - hardest part is to find some free segmented dataset, now I know why this task is still opened :) - I forgot that nothing is free in Japan and OpenSource is a very new concept.. (my exp. ...)

1. JAPANESE

1.1. mecab

popular and fastest, slightly worse accuracy than other state-of-the-art segmeters, but I was able to get it running...

speed/installation comparissons :
- https://qiita.com/hi-asano/items/aaf406db875f1c81530e
- https://nzigen.com/ysawa/mecab-chasen-kytea-juman-install/
instructions from https://taku910.github.io/mecab/#install-unix
MUST BE SU!!!

wget -O mecab-0.996.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE"
tar zxvf mecab-0.996.tar.gz
cd mecab-0.996 && ./configure && make && make check
sudo make install

1.2. mecab ipadic dictionary

dictionary is not included in mecab, not sure what ipadic stands for (jippon/nippon dictionary?)

wget -O mecab-ipadic-2.7.0-20070801.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM"
tar zxvf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801 &&./configure --with-charset=utf8 && make && make check
sudo make install

1.3 testing corpus

fees for text corpuses, even those that look like free need some CD/DVD from some org. for which you have to register and request it etc.; BCCWJ seems popular but I spent my Sunday on trying to find a way to download it - no way, but I can request CD/DVD :D :D :D
https://www.jaist.ac.jp/project/NLP_Portal/doc/LR/lr-cat-e.html
https://lionbridge.ai/datasets/japanese-language-text-datasets/

Dataset

Universal Dependecy datasets

For studying the structure of sentences in languages? - structure/tree/hierarchy of words(I think), they use special notation - CoNLL-U
https://pypi.org/project/conllu/

conda install -c conda-forge conllu

from spacy post https://github.com/explosion/spaCy/issues/3756 I got to https://github.com/UniversalDependencies/UD_Japanese-GSD
from https://github.com/megagonlabs/ginza I got to https://github.com/UniversalDependencies/UD_Japanese-BCCWJ

import MeCab
from conllu import parse_incr

jp_train_data_loc = "UD_Japanese-GSD/ja_gsd-ud-train.conllu"
jp_train_data = open(jp_train_data_loc, "r", encoding="utf-8")

senteces_list = []
tokens = []
for tokenlist in parse_incr(jp_train_data):
    # sentences
    senteces_list.append(tokenlist.metadata['text'])
    # token list of lists
    temp = []
    for token_id in range(len(tokenlist)):
        temp.append(tokenlist[token_id]["form"])
    tokens.append(temp)

print(senteces_list[0])

ホッケーにはデンジャラスプレーの反則があるので、膝より上にボールを浮かすことは基本的に反則になるが、その例外の一つがこのスクープである。

print(tokens[0])

['ホッケー', 'に', 'は', 'デンジャラス', 'プレー', 'の', '反則', 'が', 'ある', 'の', 'で', '、', '膝', 'より', '上', 'に', 'ボール', 'を', '浮かす', 'こと', 'は', '基本', '的', 'に', '反則', 'に', 'なる', 'が', '、', 'その', '例外', 'の', '一', 'つ', 'が', 'この', 'スクープ', 'で', 'ある', '。']

#-Owakati (separate into words)
#-Oyomi (Assign readings)
#-Ochasen (ChaSen compatible)
#-Odump (Full information dump)
wakati = MeCab.Tagger("-Owakati")
print(wakati.parse(senteces_list[0]).split())

['ホッケー', 'に', 'は', 'デンジャラスプレー', 'の', '反則', 'が', 'ある', 'ので', '、', '膝', 'より', '上', 'に', 'ボール', 'を', '浮かす', 'こと', 'は', '基本', '的', 'に', '反則', 'に', 'なる', 'が', '、', 'その', '例外', 'の', '一つ', 'が', 'この', 'スクープ', 'で', 'ある', '。']

2. Chinese

2.1. pkuseg

english documentation https://github.com/lancopku/pkuseg-python/blob/master/readme/readme_english.md

pip install pkuseg

2. Dataset

Dataset from Chinese word segmentation workshop http://sighan.cs.uchicago.edu/bakeoff2005/ , tar.bz2 is not available.
Training dataset is separated by spaces, see ".. of course spaces will be removed." in http://sighan.cs.uchicago.edu/bakeoff2005/data/instructions.php.html.

wget -O icwb2-data.zip "http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip"
# sudo apt-get install unzip
unzip icwb2-data.zip

import pkuseg

seg = pkuseg.pkuseg() #load the default model
text = seg.cut("我爱北京天安门")
print(text)

['我', '爱', '北京', '天安门']

dataset_loc = "icwb2-data/training/msr_training.utf8"

f = open(dataset_loc, "r")
tokenized_content = f.read().replace("\n","").split("  ")
content = "".join(tokenized_content)

print(content[:50])

“人们常说生活是一部教科书，而血与火的战争更是不可多得的教科书，她确实是名副其实的‘我的大学’。“心

print(tokenized_content[:34])

['“', '人们', '常', '说', '生活', '是', '一', '部', '教科书', '，', '而', '血', '与', '火', '的', '战争', '更', '是', '不可多得', '的', '教科书', '，', '她', '确实', '是', '名副其实', '的', '‘', '我', '的', '大学', '’', '。“', '心']

print(seg.cut(content[:50]))

['“', '人们', '常', '说', '生活', '是', '一', '部', '教科书', '，', '而', '血', '与', '火', '的', '战争', '更', '是', '不可多得', '的', '教科书', '，', '她', '确实', '是', '名副其实', '的', '‘', '我', '的', '大学', '’', '。', '“', '心']

3. Korean

3.1. KoNLPy

includes following popular tokenizers : Hannanum, Kkma, Komoran, Mecab-ko, Okt

pip install konlpy

3.2. Dataset

Sejong corpus

National balanced corpus, ~2GB, used by many git repos and papers

git clone
https://github.com/coolengineer/sejong-corpus
make all
make dict

KoNLPy built-in corpus

Documentation p.13-14:
https://konlpy.org/_/downloads/en/latest/pdf/

from konlpy.corpus import kolaw
from konlpy.corpus import kobill

Mecab-ko dictionary

from https://konlpy.org/en/v0.3.0/install/:

wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz
tar zxfv mecab-ko-dic-1.6.1-20140814.tar.gz
cd mecab-ko-dic-1.6.1-20140814
./configure
sudo ldconfig
make
sudo sh -c 'echo "dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic" > /usr/local/etc/mecabrc'
sudo make install

Usefull note

From https://nlp.stanford.edu/fsnlp/korean.html :

There are two main components in Korean orthography: the eumjeol and the eojeol. The eumjeol can be thought of as a syllable, consisting of either a hanja (Chinese characters, now used only in Korea: North Korean orthography does not use them) or a hangul syllable (hangul being the name of the Korean alphabet: a hangul is composed of one to three jamo).

An eojeol is sequence of one or more eumjeol, separated by spaces. A eojeol can represent a single inflected lexeme (Korean is quite agglutinative) or several lexemes. The placement of spaces in Korean is often a matter of style than morphology, and often appear where a pause in speech would be heard. The components of an eojeol, the Korean analog of "morphemes", are called hyung-tae-so.

from konlpy.corpus import kolaw
c = kolaw.open('constitution.txt').read()
print(c[:40])

대한민국헌법

유구한 역사와 전통에 빛나는 우리 대한국민은 3·1운동으로

from konlpy.tag import Mecab
mecab = Mecab()
print(mecab.morphs(c[:40]))
print(mecab.nouns(c[:40]))

['대한민국', '헌법', '유구', '한', '역사', '와', '전통', '에', '빛나', '는', '우리', '대한', '국민', '은', '3', '·', '1', '운동', '으로']
['대한민국', '헌법', '역사', '전통', '우리', '국민', '운동']

python

4. Evaluation

I do not yet know how to evaluate the word segemntation results (maybe this? https://segeval.readthedocs.io/en/latest/)

@zhuyifei1999, I wonder if you'd be able to help us evaluate the quality of the Chinese word segmentation?

@jeena, would you be able to help us evaluate the quality of Japanese word segmentation?

@Pavol86, I think a next step we can do here is to look at our tokenizer (https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py) and consider how we can implement these word segmenters in an intelligent way. E.g. should we apply word segmentation generally across all wiki tokenization? If we run into chinese text in, say English Wikipedia, we might segment the words of it. If we do that how much will it slow down the tokenizer on latin text? Would it increase the memory footprint substantially? It might make more sense to have a context specific tokenizer that we apply based on the target language.

@Pavol86, do you think you could create some samples of content that has been segmented? E.g., I'm imagining ~10 example paragraphs from relevant wikis presented next to the segmented/processed text? Maybe we could set up a good spreadsheet or a wiki page to allow people to make notes.

@Pavol86, @Halfak. Sorry for the late reply. I am a little busy recently.

Chinese

I have done a Chinese CWS tool performance recently (two months ago) and I have pasted the result from my report. The database is bakeoff2005's Simplified Chinese PKU corpus.

Package	Precision	Recall	F-Measure	Speed
pkuseg	0.962	0.926	0.934	3.60s
THULAC	0.917	0.815	0.923	0.39s
CoreNLP	0.901	0.894	0.897	3s++
jieba	0.850	0.784	0.816	0.23s

Conclusion

For accuracy, choose pkuseg or THULAC. If you want to choose long term support, choose jieba, as Chinese university project tend to not live and move on to a different CWS pakage for every few years. What I am worrying is because Chinese Wikimedia projects are using mixed characters, mixing Simplified and Traditional Chinese together, this may decrease CWS tool accuracy.

Japanese

Corpus

There are two corpus that you can download for free from Kyoto University:

Kyoto University Hand picked new articles corpus

Kyoto University web text corpus

Result

I have copied result from RNN 言語モデルを用いた日本語形態素解析の実用化, an article from Kyoto University in 2016

Package	Analys data F-measure	Precession data F-measure
MECAB	97.89	97.91
JUMAN	97.99	98.00

And speed from one webpage that conduct an comparision between different Japanese CWS tool in 2019

Package	Time
MeCab	0.226s
JUMAN	3.661s
JUMAN++(v2)	6.706s
Sudachi	4.119s
SudachiPy	74.872s

Conclusion

For speed, choose Mecab. For long term support, choose JUMAN++, as its F-measure is similar to Mecab and is in continous development, whereas Mecab's development stopped in 2018.

P.S

We should apply general tokenizer for all sites except CJK languages since CJK languages do not use space as word boundary. It is not necessary to have a Chinese CWS tool for only few sentences on Wikipedia, as sometimes they are there to original sentence reference or displaying what the object is called in Chinese. I think it may best to just treat Chinese in English Wikipedia as a whole word block.

Shizhao moved this task from Backlog to Extensions/Skins on the Chinese-Sites board.Jul 6 2020, 1:13 AM

@Halfak
Looking at the Japanese result posted by @Pavol86

['ホッケー', 'に', 'は', 'デンジャラスプレー', 'の', '反則', 'が', 'ある', 'ので', '、', '膝', 'より', '上', 'に', 'ボール', 'を', '浮かす', 'こと', 'は', '基本', '的', 'に', '反則', 'に', 'なる', 'が', '、', 'その', '例外', 'の', '一つ', 'が', 'この', 'スクープ', 'で', 'ある', '。']

'に', 'は' (words 1, 2) could each stand alone as a separate word but in this context are one word, には
デンジャラスプレー' (word 3) are two foreign words, デンジャラス, and プレー. I haven't seen any dictionary that lumps them together as one word.
'基本', '的', 'に' (words 19, 20, 21) could each stand alone as a separate word but combined together they are one word, 基本的に.

Note that some punctuation has been included in the tokenized output.

@jeena, @Pavol86 & @Halfak Mecab does have multiple dictionaries to use. I would suggest redoing the testing using JUMAN++ instead.

calbon claimed this task.Jul 7 2020, 4:40 PM

NOTE1: I have some issues with creating a pull request for delta package, we should be able to resolve it with Aaron...
NOTE2: this is the application + explanation of the code, I thank @jeena and @VulpesVulpes825 for their ideas, we will check the performance of other dictionaries/tools, but for now I wanted to have a working basic CJK tokenizer..

Tokenizer description as sent for a git pull request

Main changes:

I added new type "cjk_word"
I added hangul Unicode symbols (Korean alphabet)
I added cjk=True/False value to regexp tokenizer
- False - text is tokenized "almost as before" - based on regexp, etc., but the CJK words are not tokenized to symbols, but they are kept as a continuous sequence of symbols (demarked by whitespace, etc. just like any other word) and are marked as "cjk_word", just as @VulpesVulpes825 suggested in a note in https://phabricator.wikimedia.org/T111179
- True - everything is done as for False + text is checked for a count of CJK, Japanese and Korean symbols; if there are at least 25% Japanese or Korean symbols then Jap/Kor tokenizer is used(depending which higher number of symbols) - otherwise Chinese tokenizer is used. This is due to the use of Chinese simplified and traditional symbols used in the Japan/Korean alphabet. Tokenizer goes through tokens previously marked as "cjk_word" (from the end to the beginning of the text), segments them, and replaces the original word.

Examples

To test cjk=False check results of mixed language articles - en wiki which includes ch, jp, ko language. Change only the title to eg.:

"China" - Chinese
"Haiku" - Japanese
"Kimchi" - Korean

To test cjk=True tokenize same articles in Chinese(zh), Japanese(ja), Korean(ko) wikis. Adjust URL to each language and change title name to following symbols (same as above only in ch, jp, ko language):

"김치" - Korean(ko)
"中国" - Chinese(zh)
"俳句" - Japanese(ja)

CJK = TRUE/FALSE test

You can see that when cjk=True long word at the beginning is segmented.

import mwapi
import deltas
import deltas.tokenizers

import importlib
importlib.reload(deltas.tokenizers)

session = mwapi.Session("https://ja.wikipedia.org")
doc = session.get(action="query", prop="revisions",
                  titles="俳句", rvprop="content", rvslots="main",
                  formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']

tokenized_text_cjk_false = deltas.tokenizers.wikitext_split.tokenize(text, cjk=False)
tokenized_text_cjk_true = deltas.tokenizers.wikitext_split.tokenize(text, cjk=True)

Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.

tokenized_text_cjk_false[:20]

[Token('{{', type='dcurly_open'),
 Token('Otheruses', type='word'),
 Token('|', type='bar'),
 Token('|', type='bar'),
 Token('角川文化振興財団の俳句総合誌', type='cjk_word'),
 Token('|', type='bar'),
 Token('俳句', type='cjk_word'),
 Token(' ', type='whitespace'),
 Token('(', type='paren_open'),
 Token('雑誌', type='cjk_word'),
 Token(')', type='paren_close'),
 Token('}}', type='dcurly_close'),
 Token('\n', type='whitespace'),
 Token('{{', type='dcurly_open'),
 Token('複数の問題', type='cjk_word'),
 Token('\n', type='whitespace'),
 Token('|', type='bar'),
 Token(' ', type='whitespace'),
 Token('参照方法', type='cjk_word'),
 Token(' ', type='whitespace')]

tokenized_text_cjk_true[:20]

[Token('{{', type='dcurly_open'),
 Token('Otheruses', type='word'),
 Token('|', type='bar'),
 Token('|', type='bar'),
 Token('角川', type='cjk_word'),
 Token('文化', type='cjk_word'),
 Token('振興', type='cjk_word'),
 Token('財団', type='cjk_word'),
 Token('の', type='cjk_word'),
 Token('俳句', type='cjk_word'),
 Token('総合', type='cjk_word'),
 Token('誌', type='cjk_word'),
 Token('|', type='bar'),
 Token('俳句', type='cjk_word'),
 Token(' ', type='whitespace'),
 Token('(', type='paren_open'),
 Token('雑誌', type='cjk_word'),
 Token(')', type='paren_close'),
 Token('}}', type='dcurly_close'),
 Token('\n', type='whitespace')]

Walkthrough the process...

1. Get the text and regexp tokenize it

import mwapi
import deltas
import deltas.tokenizers

import importlib
importlib.reload(deltas.tokenizers)

# example Titles for mixed language sites(en wiki which includes cj,jp,ko language): "Haiku" - Japanese; "Kimchi" - Korean; "China" - Chinese
# same titles in ch,jp,ko wikis "김치" - Korean(ko), "中国" - Chinese(zh), "俳句" - Japanese(ja)
session = mwapi.Session("https://ja.wikipedia.org")
doc = session.get(action="query", prop="revisions",
                  titles="俳句", rvprop="content", rvslots="main",
                  formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']

tokenized_text = deltas.tokenizers.wikitext_split.tokenize(text, cjk=False)

Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.

text[::-1][:500]

"]]ルンャジの学文:yrogetaC[[\n]]形詩:yrogetaC[[\n]]詩:yrogetaC[[\n]]諧俳:yrogetaC[[\n]]くいは*|句俳:yrogetaC[[\n}}くいは:TROSTLUAFED{{\n\n}}能芸統伝の本日{{\n\n'）句俳音（ukiaH-otO' ].baL dnuoS s'reenoiP lmth.xedni/baldnuos/pj.reenoip//:ptth[*\n]hcnerF dna nailatI ,hsilgnE ,esenapaJ ni 句俳 ukiaH /ti.iniccipinot.www//:ptth[*\n]ukiaH odraciR /moc.topsgolb.ukiah-odracir//:ptth[*\n]sukiaH hsinapS sukiah/moc.muirotircse.www//:ptth[*\n]ukiaH led onimaC úbmab ed euqsoB /ubmab_ed_euqsob/moc.seiticoeg.se//:ptth[*\n]aciremA fo yteicoS ukiaH ehT /gro.uk"

tokenized_text[::-1][:20]

[Token(']]', type='dbrack_close'),
 Token('文学のジャンル', type='cjk_word'),
 Token(':', type='colon'),
 Token('Category', type='word'),
 Token('[[', type='dbrack_open'),
 Token('\n', type='whitespace'),
 Token(']]', type='dbrack_close'),
 Token('詩形', type='cjk_word'),
 Token(':', type='colon'),
 Token('Category', type='word'),
 Token('[[', type='dbrack_open'),
 Token('\n', type='whitespace'),
 Token(']]', type='dbrack_close'),
 Token('詩', type='cjk_word'),
 Token(':', type='colon'),
 Token('Category', type='word'),
 Token('[[', type='dbrack_open'),
 Token('\n', type='whitespace'),
 Token(']]', type='dbrack_close'),
 Token('俳諧', type='cjk_word')]

2. Find out which tokenizer you should use

cjk_char = (
    r'[' +
    r'\uAC00-\uD7AF' + # hangul syllables
    r'\u1100-\u11FF' + # hangul jamo
    r'\u3130–\u318F' + # hangul
    r'\uA960–\uA97F' + # hangul
    r'\uD7B0–\uD7FF' + # hangul
    r'\u4E00-\u62FF' +  # noqa Unified Ideographs
    r'\u6300-\u77FF' +
    r'\u7800-\u8CFF' +
    r'\u8D00-\u9FCC' +
    r'\u3400-\u4DFF' +  # Unified Ideographs Ext A
    r'\U00020000-\U000215FF' +  # Unified Ideographs Ext. B
    r'\U00021600-\U000230FF' +
    r'\U00023100-\U000245FF' +
    r'\U00024600-\U000260FF' +
    r'\U00026100-\U000275FF' +
    r'\U00027600-\U000290FF' +
    r'\U00029100-\U0002A6DF' +
    r'\uF900-\uFAFF' +  # Compatibility Ideographs
    r'\U0002F800-\U0002FA1F' +  # Compatibility Ideographs Suppl.
    r'\u3041-\u3096' +  # Hiragana
    r'\u30A0-\u30FF' +  # Katakana
    r'\u3400-\u4DB5' +  # Kanji
    r'\u4E00-\u9FCB' +
    r'\uF900-\uFA6A' +
    r'\u2E80-\u2FD5' +  # Kanji radicals
    r'\uFF5F-\uFF9F' +  # Katakana and Punctuation (Half Width)
    r'\u31F0-\u31FF' +  # Miscellaneous Japanese Symbols and Characters
    r'\u3220-\u3243' +
    r'\u3280-\u337F' + 
    r']'
)

jap_char = (
    r'[' +
        r'\u3041-\u3096' +  # Hiragana
        r'\u30A0-\u30FF' +  # Katakana
        r'\u3400-\u4DB5' +  # Kanji
        r'\u2E80-\u2FD5' +  # Kanji radicals
        r'\uFF5F-\uFF9F' +  # Katakana and Punctuation (Half Width)
        r'\u31F0-\u31FF' +  # Miscellaneous Japanese Symbols and Characters
    r']'
)

# https://en.wikipedia.org/wiki/Hangul
# https://en.wikipedia.org/wiki/Hangul_Jamo_(Unicode_block)
# https://en.wikipedia.org/wiki/Hangul_Syllables
kor_char = (
    r'[' +
    r'\uAC00-\uD7AF' + # hangul syllables
    r'\u1100-\u11FF' + # hangul jamo
    r'\u3130–\u318F' + # hangul
    r'\uA960–\uA97F' + # hangul
    r'\uD7B0–\uD7FF' + # hangul
    r']'
)

CJK_LEXICON = {
    'cjk': cjk_char,
    'japanese': jap_char,
    'korean': kor_char,
}

import re
regex_cjk = re.compile(CJK_LEXICON['cjk'])
regex_japanese = re.compile(CJK_LEXICON['japanese'])
regex_korean = re.compile(CJK_LEXICON['korean'])

# Haiku article had 49% of the characters japanese, rest were "cjk" so I decided to put "empirical rule of thumb" that at least 25% of the text must be korean/japanese
# for Korean Kimchi article there was >90% of the symbols from Hangul
char_lang_count = {'cjk': 1,
                   'japanese': 0.75 + len(regex_japanese.findall(text))/len(regex_cjk.findall(text)),
                   'korean': 0.75 + len(regex_korean.findall(text))/len(regex_cjk.findall(text))}
char_lang = max(char_lang_count, key=char_lang_count.get)

char_lang_count

{'cjk': 1, 'japanese': 1.241775763679087, 'korean': 0.75}

3. Winner is Japanese

char_lang

'japanese'

4. We find indices of cjk_words

cjk_word_indices = list(filter(lambda x: tokenized_text[x].type == 'cjk_word', range(len(tokenized_text))))

5. check the output of the tokenizer

# japanese
import MeCab as jp_mecab
seg = jp_mecab.Tagger("-Owakati")
for i in cjk_word_indices[::-1][:20]:
    temp=seg.parse(tokenized_text[i]).split()
    print(temp)

['文学', 'の', 'ジャンル']
['詩形']
['詩']
['俳諧']
['は', 'いく']
['俳句']
['は', 'いく']
['日本', 'の', '伝統', '芸能']
['音', '俳句']
['俳句']
['日本', '漢', '俳', '学会']
['国際', '俳句', '交流', '協会']
['ジャック', 'たけし', 'の', '英語', '俳句']
['世界', '俳句', '協会']
['滑稽', '俳句', '協会']
['新', '俳句', '人', '連盟']
['俳人', '協会']
['現代', '俳句', '協会']
['日本', '伝統', '俳句', '協会']
['外部', 'リンク']

6. Apply the Token class (as in deltas package) on new segmented cjk_words and replace the previous sequences

class Token(str):
    """
    Constructs a typed sub-string extracted from a text.
    """
    __slots__ = ("type")

    def __new__(cls, content, *args, **kwargs):
        if isinstance(content, cls):
            return content
        else:
            return super().__new__(cls, content)

    def tokens(self):
        """
        Returns an iterator of *self*.  This method reflects the behavior of
        :meth:`deltas.Segment.tokens`
        """
        yield self

    def __init__(self, content, type=None):
        self.type = str(type) if type is not None else None
        """
        An optional value describing the type of token.
        """

    def __repr__(self):
        return "{0}({1}, type={2})" \
               .format(self.__class__.__name__,
                       super().__repr__(),
                       repr(self.type))

token_class = Token

# japanese with tokenization
for i in cjk_word_indices[::-1]:
    segmented_cjk_token = seg.parse(tokenized_text[i]).split()
    tokenized_text[i:i+1] = [token_class(word, type="cjk_word") for word in segmented_cjk_token]

tokenized_text[::-1][:20]

[Token(']]', type='dbrack_close'),
 Token('ジャンル', type='cjk_word'),
 Token('の', type='cjk_word'),
 Token('文学', type='cjk_word'),
 Token(':', type='colon'),
 Token('Category', type='word'),
 Token('[[', type='dbrack_open'),
 Token('\n', type='whitespace'),
 Token(']]', type='dbrack_close'),
 Token('詩形', type='cjk_word'),
 Token(':', type='colon'),
 Token('Category', type='word'),
 Token('[[', type='dbrack_open'),
 Token('\n', type='whitespace'),
 Token(']]', type='dbrack_close'),
 Token('詩', type='cjk_word'),
 Token(':', type='colon'),
 Token('Category', type='word'),
 Token('[[', type='dbrack_open'),
 Token('\n', type='whitespace')]

APPENDIX - Chinese and Korean tokenizers

Chinese

import pkuseg
seg = pkuseg.pkuseg() #load the default model
for i in cjk_word_indices[::-1][:2]:
    print(seg.cut(tokenized_text[i]))

['亞洲', '分裂', '地', '區']
['分類']

Korean

In Korean some symbol combinations result in empty list, I guess this is beacause they are not nouns. They can be still separated to morphs - see result below, but this is too high granularity for our purpose so if the result is empty I just keep previous Token.

from konlpy.tag import Mecab as ko_mecab
seg = ko_mecab()

seg.nouns("있다")

seg.morphs("있다")

['있', '다']

for i in cjk_word_indices[::-1][:20]:
    temp=seg.nouns(tokenized_text[i])
    if temp == []:
        temp = tokenized_text[i]
    print(temp)

['식품']
['발효']
['분류']
['요리']
['한국']
['분류']
['김치']
['분류']
['통제']
['전거']
['글로벌', '세계', '대백', '사전']
['김치']
['한국', '김치', '절임', '식품', '공업', '협동조합']
['과학']
['김치']
농익은
['캐스트']
['네이버']
['김치']
['김치']

Talking with @Pavol86, it looks like we need to be able to install mecab and the related dictionaries in order to process Japanese and Korean.

These are Pavol's notes:

CHINESE
install pkuseg

pip install pkuseg

JAPANESE
install mecab

wget -O mecab-0.996.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE"
tar zxvf mecab-0.996.tar.gz
cd mecab-0.996 && ./configure && make && make check
sudo make install

install mecab ipadic dictionary

wget -O mecab-ipadic-2.7.0-20070801.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM"
tar zxvf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801 &&./configure --with-charset=utf8 && make && make check
sudo make install

KOREAN
install konlpy

pip install konlpy

install Mecab-ko dictionary

wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz
tar zxfv mecab-ko-dic-1.6.1-20140814.tar.gz
cd mecab-ko-dic-1.6.1-20140814
./configure
sudo ldconfig
make
sudo sh -c 'echo "dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic" > /usr/local/etc/mecabrc'
sudo make install

We're going to need to get these from apt repos somehow rather than compiling them ourselves.

Moving to main workboard because @Pavol86 is actively making progress on this task.

@Halfak I need your feedback on following. According to our call last week I did following :

make the ch, jp, ko tokenizer decision more explicit in the code
add "# noqa" to lines that should have >85 chars - as a workaround for flake8/pep8 test
performance tests : run the code 100-1000x times on the same article and compare performance between prev/new version, cjk tokenization True/False

3.1 I tested the performance of original deltas tokenizer, see following boxplots - y-axis marks the type of wiki and type of text (EN wiki with EN text, Chinese wiki with Chinese text,... EN wiki with Chinese text, etc..)

3.2 I found out that the loading of the Chinese tokenizer model is a bottleneck, so I tested pkuseg, thulac, jieba on Chinese wiki with Chinese text. Jieba is the only tokenizer that needs to be initialized only once and then it is kept in memory. Pkuseseg and Thulac take 2-3s to initialize. Model load of each tokenizer (including jap and kor)

language	model	load time	additional installation needed?
ch	pkuseg	2.687344551086426	yes
ch	thulac	1.8775367736816406	no
ch	jieba	1.3582587242126465	no
jp	mecab	0.0023169517517089844	yes
ko	hannanum	0.0010521411895751953	no
ko	kkma	0.0013973712921142578	no
ko	komoran	1.694901943206787	no
ko	mecab	0.0013427734375	yes
ko	okt	0.0007483959197998047	no

Chinese text tokenization with each tokenizer (on Chinese wiki with Chinese text and EN wiki with some Chinese text):

Deltas Chinese Tokenizers cjk=True.png (351×376 px, 11 KB)

3.2 I tested the performance of New Deltas tokenizer (that has CJK tokenization) with new Jieba Chinese tokenizer and cjk flag set to True/False:

Deltas New cjk=True.png (332×386 px, 12 KB)

Deltas New cjk=False.png (332×392 px, 12 KB)

NOTES:

Please consider scale on y-axis, I did not unite the scale as the test of Chinese tokenizers is from 1s to 8s - this would make other plots pointless
Jieba is has worse accuracy but is much faster + it does not need additional Mecab dictionary installation - pip install is enough
We also talked about packaging/deployment of the solution

3.1 Chinese - Jieba needs only pip install, no additional dictionary
3.2 Japanese - every package I found needed additional installation, i.e. thwy were wrapper methods for - Mecab, JUMAN, JUMAN++, KyTea,... , and the tokenizer had to be installed separately
3.3 Korean - we can use builtin methods like hannanum and kkma that does not need any additional installation (see page 33 https://konlpy.org/_/downloads/en/latest/pdf/) - you may see the load time in the table under point 2.2

@Halfak , please send me a feedback : what approach do you want to take? Chinese/Korean are no longer issue, is it possible to get the Mecab or any other JP tokenizer installed in production?

Given that we are likely trying to use these segmenters in order to get *signal* and not to translate or do something more exact, I'm a fan of faster, lower accuracy, and easier to install methods. It looks like Japanese will be the most difficult.

FINAL NOTES (hopefully :) ):
Japanese:

I did haven't tried SudachiPy as I saw poor performance stats, It is the only JP tokenizer that I was able to get running just by "pip install" without any additional instructions
SudachiPy model loads quickly:

jp_sudachi model load time: 0.03719305992126465

SudachiPy has 3 splitting modes A|B|C (https://github.com/WorksApplications/Sudachi ):

Sudachi provides three modes of splitting. In A mode, texts are divided into the shortest units equivalent to the UniDic short unit. In C mode, it extracts named entities. In B mode, into the middle units.

- I decided to use "mode A" - after checking splitting results
SudachiPy has 3 possibilities for dictionary small | core[default] | full (https://github.com/WorksApplications/SudachiDict):

Small: includes only the vocabulary of UniDic
Core: includes basic vocabulary (default)
Full: includes miscellaneous proper nouns

there is only a slight difference in the performance of tokenizer with each dict. (small slightly faster than core, etc.), see:

Deltas_Japanese_Tokenizer-Sudachi_Dictionaries_cjk_True.png (501×486 px, 38 KB)

I recommend use of full dict with split mode A, to download/use full dict:

pip install sudachidict_full
sudachipy link -t full

Korean:

I decided to go with Okt (Tweeter tokenizer) : (i) fastest model load time, (ii) 2nd best performance after Mecab(which needs additional installation), for performance comparison see: https://github.com/open-korean-text/open-korean-text

Chinese:

I decided to use Jieba, see previous post in the thread

PERFORMANCE:

no additional install needed, "pip install" is enough!
as I previously mentioned I ran the test 100x on following wiki pages:

"https://en.wikipedia.org" : ["The_Doors", "China", "Haiku", "Kimchi"]
"https://zh.wikipedia.org" : "中國" ("China" in Chinese)
"https://ja.wikipedia.org" : "俳句" ("Haiku" in Japanse)
"https://ko.wikipedia.org" : "김치" ("Kimchi" in Korean)

cjk tokenization turned on (cjk = True)

Deltas FINAL, cjk=True.png (408×475 px, 29 KB)

cjk tokenization turned off (cjk = False)

Deltas FINAL, cjk=False.png (413×486 px, 31 KB)

I believe we can move forward, do final adjustment to the code and get it into production...
side notes:

maybe some strategies from FB research library LASER could be utilized for wiki:

https://github.com/facebookresearch/LASER

just when I stopped searching I found free Japanse corpus for testing :

https://masatohagiwara.net/nltk-japanese-corpus.html

@Pavol86. Congratulation on finding a Japanese word segmentation package that does not require additional compiling. After my own testing, I believe SudachiPy's word segmentation performance is good enough. However, I believe going with the full dictionary option and the B mode is a better choice, as the A mode segment too much. The A mode basically segments "basic" and "lly" apart. In the Chinese word segmentation guidelines, this is not acceptable.

Also, Is it possible for you to guide me to the source code of your tokenization tool? I believe this tokenization tool, with modification, can be used in other projects such as Content Translation.

@VulpesVulpes825 thank you for the recommendation! I do not speak any of the languages so I am "best guessing" all the way :). The CJK tokenization should be part of the deltas library at the end - https://github.com/halfak/deltas . I prepared the code to be merged(pull request) with deltas and I have a call with @Halfak today. I will keep you updated..

Pavol86 mentioned this in T238712: [Open question] How to more effectively detect spambot accounts?.Aug 17 2020, 1:00 PM

calbon closed this task as Resolved.Sep 23 2020, 4:17 PM

calbon moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.

Shizhao moved this task from Extensions/Skins to Closed on the Chinese-Sites board.Sep 25 2020, 8:31 AM

	F31945509: Deltas_Japanese_Tokenizer-Sudachi_Dictionaries_cjk_True.png
	Jul 23 2020, 8:29 AM

	F31936088: Deltas Chinese Tokenizers cjk=True.png
	Jul 15 2020, 2:21 PM

Tokenization of "word" things for CJKClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Note

1. JAPANESE

1.1. mecab

1.2. mecab ipadic dictionary

1.3 testing corpus

Dataset

Universal Dependecy datasets

2. Chinese

2.1. pkuseg

2. Dataset

3. Korean

3.1. KoNLPy

3.2. Dataset

Sejong corpus

KoNLPy built-in corpus

Mecab-ko dictionary

Usefull note

4. Evaluation

Chinese

Conclusion

Japanese

Corpus

Result

Conclusion

P.S

Tokenizer description as sent for a git pull request

Examples

CJK = TRUE/FALSE test

Walkthrough the process...

1. Get the text and regexp tokenize it

2. Find out which tokenizer you should use

3. Winner is Japanese

4. We find indices of cjk_words

5. check the output of the tokenizer

6. Apply the Token class (as in deltas package) on new segmented cjk_words and replace the previous sequences

APPENDIX - Chinese and Korean tokenizers

Chinese

Korean

Tokenization of "word" things for CJK
Closed, ResolvedPublic
Actions

Related Objects
Search...