Chinese language utilities
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ToAruShiroiNeko
	Aug 17 2015, 8:09 PM

Description

generate TFiDF badword lists
review and aggregation of badwords/informal words by native speaker
implement revscoring.language.Language (Language utility)

Related Objects
Search...

Status	Assigned	Task
Open	None	T227094 Update RC Filters for new ORES capacities (July, 2019)
Resolved	SBisson	T225561 Update ORES thresholds for nlwiki
Open	None	T223273 Update srwiki thresholds for goodfaith model
Resolved	SBisson	T225562 Deploy ORES filters for zhwiki
Open	None	T225563 Deploy ORES filters for jawiki
Resolved	Halfak	T224484 ORES deployment: Early June
Resolved	Halfak	T224481 Train/test zhwiki editquality models
Resolved	Halfak	T223382 Improvements to ORES localization and support
Resolved	Halfak	T109366 Chinese language utilities
Resolved	None	T110841 Chinese language compatibility
Resolved	Ladsgroup	T110964 TF-IDF to determine global stop words
Resolved	Ladsgroup	T109844 Omit the interwikilinks from stop words
Resolved	Pavol86	T111179 Tokenization of "word" things for CJK

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ToAruShiroiNeko lowered the priority of this task from Unbreak Now! to Low.Aug 30 2015, 2:56 PM

ToAruShiroiNeko raised the priority of this task from Low to High.

ToAruShiroiNeko added a subtask: T110841: Chinese language compatibility.

ToAruShiroiNeko added a subtask: T110964: TF-IDF to determine global stop words.Aug 31 2015, 10:56 PM

Halfak moved this task from Backlog to Paused on the Machine-Learning-Team (Active Tasks) board.Sep 17 2015, 10:50 PM

Halfak moved this task from Paused to Backlog on the Machine-Learning-Team (Active Tasks) board.

Halfak moved this task from Backlog to Paused on the Machine-Learning-Team (Active Tasks) board.Sep 18 2015, 5:01 PM

Ladsgroup closed subtask T110964: TF-IDF to determine global stop words as Resolved.Sep 22 2015, 10:33 AM

Halfak moved this task from Paused to Backlog on the Machine-Learning-Team (Active Tasks) board.Sep 25 2015, 4:48 PM

Halfak updated the task description. (Show Details)Oct 30 2015, 5:37 PM

@ToAruShiroiNeko, It looks like JimmyZu did some work on this list, but some of the items don't seem to make sense (e.g. "benjamin"). We also don't have any informals. Thoughts?

Halfak closed subtask T110841: Chinese language compatibility as Resolved.Nov 19 2015, 11:43 PM

Liuxinyu970226 added a subscriber: Shizhao.Dec 9 2015, 8:21 AM

In T109366#1807650, @Halfak wrote:

@ToAruShiroiNeko, It looks like JimmyZu did some work on this list, but some of the items don't seem to make sense (e.g. "benjamin"). We also don't have any informals. Thoughts?

Typo: His name is @jimmyxu

Liuxinyu970226 awarded a token.Dec 18 2015, 2:51 PM

Halfak moved this task from Backlog to Parked on the Machine-Learning-Team (Active Tasks) board.Apr 2 2016, 8:22 PM

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Apr 3 2016, 8:01 AM

Halfak moved this task from Unsorted to Blocked on community input on the Machine-Learning-Team board.Apr 4 2016, 2:47 PM

Halfak added a project: revscoring.

Halfak lowered the priority of this task from High to Lowest.Jul 6 2016, 3:10 PM

Halfak merged a task: T145663: Add language support for Chinese.Sep 15 2016, 3:00 PM

Stang subscribed.Sep 23 2016, 9:42 AM

Asked a week earlier but I don't know this task until now… I think the "badwords" and the "informal" word lists should be updated with the help of local community. A notice has been published on our Village Pump.

Shizhao added a parent task: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].Sep 28 2016, 3:39 AM

Aklapper added a project: Chinese-Sites.Dec 21 2016, 9:58 AM

Aklapper removed a parent task: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].Dec 21 2016, 10:03 AM

https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/zh still looks messy. @Cosine02, was there any response on the Village Pump?

In T109366#3058575, @Halfak wrote:

https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/zh still looks messy. @Cosine02, was there any response on the Village Pump?

I have ask zhwp VP https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E4%BA%92%E5%8A%A9%E5%AE%A2%E6%A0%88/%E6%8A%80%E6%9C%AF&oldid=43402178#ORES.E7.9A.84word_list.E9.9C.80.E8.A6.81.E4.B8.AD.E6.96.87.E7.A4.BE.E7.BE.A4.E7.BB.99.E5.87.BA.E5.8F.8D.E9.A6.88.E5.92.8C.E6.84.8F.E8.A7.81

Thanks @Shizhao! I'm excited to get the basic infra in place for zhwiki :)

It appears that the common word stats is quasi-incorrectly segmenting chinese words into single characters. Have you looked into anything like https://github.com/fxsjy/jieba for proper, pre-trained Chinese word segmentation?

Hi @Arthur2e5. I just started looking into word segmentation now (jieba and https://github.com/isnowfy/snownlp). It seems like this is certainly something that we can work with. @Ladsgroup, I'd like to make some changes to revscoring so that we can have chinese segmentation be common between our modeling and the Bad-Words-Detection-System process. Does that make sense to you? I think we might want a special module for CJK in revscoring.languages, but then again, maybe it should be fundamental to revscoring.features.tokenized. If we make it part of the core set of tokenized we get a lot of things for "free", but we'll need to make Chinese, Japanese, and Korean segmenters part of the general requirements. If we make a special module in revscoring.languages, you'd only need to load these libraries for feature sets that use them. But we'll need to re-implement basic diffing and token counting features.

Arthur2e5 added a subtask: T111179: Tokenization of "word" things for CJK.Mar 5 2017, 8:04 PM

The current "bad words" list contains a mix of simp/trad versions for these "steal QQ account" stuff. Are these duplicates necessary, or can the module figure out as mentioned in T110841?

Dexbot's version of words seems to have figured out some interesting words and phrases by using TF-IDF: https://meta.wikimedia.org/w/index.php?title=Research:Revision_scoring_as_a_service/Word_lists/zh&oldid=14070179 (But what happened to these WW2 stuff?)

WhitePhosphorus subscribed.Jun 19 2017, 1:06 AM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 19 2017, 1:06 AM

Halfak mentioned this in T170015: [Workshop] How can I get ORES for my wiki?.Aug 4 2017, 3:44 PM

chinese badwords: https://github.com/pychen0918/bad-words-chinese
chinese informal words: https://resources.allsetlearning.com/chinese/grammar/Formal_and_informal_function_words

YFdyh000 subscribed.Dec 8 2017, 4:44 PM

@Halfak status?

Sorry for the late reply. We've been really backed up this quarter. I'm hoping to come back to this modeling work in a couple of weeks. Sorry for the delay!

back to it1 I'm looking through the bad and informal words. I should have a PR soon.

It looks like the informal term "和" appears quite often in the opening paragraph of the featured article for today:

2005年大西洋颶風季是有纪录以来最活跃的大西洋颶風季，至今仍保持着多项纪录。
全季对大范围地区造成毁灭性打击，共导致3,913人死亡，损失数额更创下新纪录，高达1592亿美元。
本季单大型飓风就有7场之多，其中5场在登陆时仍有大型飓风强度，分别是颶風丹尼斯、艾米莉、
卡特里娜、丽塔和威尔玛，大部分人员伤亡和财产损失都是这5场飓风引起。
墨西哥的金塔納羅奧州和尤卡坦州 [...]

Am I reading this right?

@Liuxinyu970226, I'm trying to work out whether our informals detection strategy is just not going to work or if there's just some informal language used in the article I chose to use as a counter-example.

Xiplus subscribed.Oct 12 2018, 9:52 AM

@Halfak You may also try to ask @RazeSoldier

https://labels.wmflabs.org/stats/zhwiki/45

The task done

In T109366#4715626, @Shizhao wrote:

https://labels.wmflabs.org/stats/zhwiki/45

The task done

It is awesome but it's about the data for damaging model for ores, this task is about language assets (i.e. badwords and informal words)

The labeled dataset is the hard part! We should be able to gather some badwords/informals more quickly. Can someone address my question from T109366#4308813?

In T109366#4308813, @Halfak wrote:

It looks like the informal term "和" appears quite often in the opening paragraph of the featured article for today:
Am I reading this right?

I think that "和" is a stop word.

I see. It looks like "和 (hé)" appears in the list of informal words here: https://resources.allsetlearning.com/chinese/grammar/Formal_and_informal_function_words

Are there any others that would be appropriate to see in an encyclopedia article that I should exclude from the list?

和（he2）by itself typically means “and”, so please treat it as a stop word as it is. Do not attempt any filtering on it.

Many other formal/informal terms provided in the list are arguably too strict, for example with the use of 所以 (so) /因此 (therefore) among others. Blindly using that list would turn zh.wp into some completely unreadable robotic-academic wordpile.

I see. So if I remove those three from the list, then it is otherwise representative of informal language?

Sorry for the late reply.

The present state of written Chinese is that everything we have now is basically derived from the vernacular -- the informal speak -- of circa late 19C-early 20C (or earlier, in the case of old vernacular novels), and there is almost intentionally little difference between what is spoken and what should be written. The vocabulary difference would mostly be some necessary jargon for certain concepts plus a few conventions, and anything beyond that -- especially many (conjunctives!) found in that list -- is occasionally considered superficial and pretentious. "We moved away from writing fake lzh almost a century ago to make it more human, so why would we go back at all?"

In my opinion what matters for Chinese is probably not a list or a classifier of "informal" expression, but something that detects basic bad/abusive words and spam/promotional speak. (Which is exactly the other thing we have been doing!) Informal filters are an analog of a human heuristic for information density that does not work too well in this language. (Transitional phrases that may indicate reasoning may as well be a good one, but on the other hand they also suggest the presence of original research…)

That said, if you do want a list of very informal words to avoid, check out Category:俗语 on zhwp along with the sub-categories. These slangs are basically what you go to urbandictionary for if they were in English -- probably new, usually not encyclopedic. I doubt if anyone is going to use many of these words at all, but that probably serves the purpose of the list as they do tend to raise a red flag for human patrollers.

This is very informative! I think that if "informal" language doesn't make sense for Chinese, let's not worry about that. Let's instead focus on "bad/abusive words and spam/promotional speak" as you suggest. What would be the best way to build a list of such words/phrases?

Tiger3018 subscribed.Jan 31 2019, 6:12 AM

Wang_Qiliang subscribed.Jan 31 2019, 8:25 AM

Taiwania_Justo subscribed.Jan 31 2019, 8:36 AM

94rain subscribed.Jan 31 2019, 1:12 PM

I'm not talented in computer science, but following things may be helpful.

It's better to generate a word-character list than the character/word only list, because many characters with diffrent adjacent words can make different meanings (like 广告/宣传/advertisment bad?, 告诉/传导/make someone learn about something common?). But there are many characters which can be used independently.( But nearly same meaning usually)

Also, too long compound word can be meaningless but useful for scanning spam. e.g. 公司(company)信仰(belief) seems like a promotional speak but 公司/信仰 are common words. Simpifiled and traditonal chinese can have much difference because of separated language environment.

So is it possible to use a open source program, just like jieba (on github, MIT), to solve the word-split problem? Or other thesis about chinese word segmentation?

About informal express: I think many Internet slang or promotional word can be treated as this type. e.g. 要知道灌水网友躺赢跑腿刷粉喵喵

This is really helpful. Thank you. I'm not sure if it would be practical, but with our pattern matching strategy, we can certainly specify the context around a word. What would it take to specify enough relevant context that we could catch spammy sequences like 公司(company)信仰(belief) but not just 信仰(belief)? I think this could be reasonable if there are relatively few patterns around 信仰(belief) that could be considered problematic.

For an example of how we do this for English Spammy phrases, see: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py#L228 I think the "regular expressions" are somewhat self-explanatory. On the line I linked, we would match "scholars believe" but not "Scholars from UC Berkeley" or "Albert Einstein believes".

OK with all of that said, I think word-splitting is a good option if we have a good library for it. What would that give us in this instance? It still seems like context is essential to matching effectively.

Yeah with regex recognizing spammy expression can be much easier. So I was worried about how to spilt word while machine scan the text and give the data. And these data to scanning rules, not zero to scanning rules?

By the way, happy Chinese new year. Cheers!!

\o/

OK so what would be a good way to get a nice set of example sequences that are problematic? We'll need to set up the regexes for Chinese the same way we set them up for English. I wonder if there is a Manual of Style page like https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Words_to_watch or maybe some abuse filter rules that we could draw from.

If there are no such lists, we could experiment with creating some on a wiki page somewhere. Then I can use that train ORES and we can see how it works. What do you think?

Great. So start to write the problemtic/words_to_watch expression regex now? (Or other expression/ALL)

As for Manual... Just click "Languages" tab and goto

zh:WP:AVOID It seems like a translating based article but still useful.
zh:WP:AWW like above
zh:WP:APT Manual style

(Saved for writing regexes)

Tiger3018 raised the priority of this task from Lowest to Medium.Feb 6 2019, 3:31 PM

This looks very useful. It's hard for me to interpret without some command of the language. Machine translation kind of jumbles up what I can get from it.

E.g., it does seem like "據說／據稱／聽說／傳說" is a good sequence to avoid. I'm not quite sure how to break that down though. For example, does "據稱" translate to "It is alleged"?

@Tiger3018 do you have the expertise to make a pass on turning these into regular expressions and examples sentences?

據稱/据称 like "It’s reported".

And making some regex rules is a easy thing for me (I think😂), but I need time to do it.

A simple regex rule: click here

(*)I have a problem: There are simplifiled and traditional Chinese used to write article, so should the matching rules include both expressions?

e.g. 據稱 - zh-hant / 据称 - zh-hans

I am thinking about using some preprocessing for Hans/hant -- we can flatten the text to one of the variants (character-by-charcter; no word based things so it is more predictable) and then perform the match.

Flattening could work if there is a 1:1 matching. It looks like this could work: https://pypi.org/project/zhconv/

I think it can help a lot. Most transfering can work in the rules with 1:1 matching, but it need human checking.

I think the best next step is to provide a starting list of regexes for badwords, informals, or words to watch. Then we can run some experiments with zhconv to see if we're doing OK.

A basic list : https://zh.wikipedia.org/wiki/Wikipedia:%E6%A0%87%E6%B3%A8/revscoring_doc

Halfak raised the priority of this task from Medium to High.Feb 19 2019, 10:30 PM

Halfak lowered the priority of this task from High to Medium.

Halfak moved this task from Blocked on community input to Ready to go on the Machine-Learning-Team board.

So what's going on about this issue?

Need a PR with python source on github, a more detailed / traditional Chinese support list or something else?

(+) A question from Shizhao : this website can showing the ORES score for editing in Chinese languages, why?

Hi @Tiger3018! I've added this to our backlog. Now that we have the basic list you have provided, we need to do the work to get it integrated into a language file like the ones you see here: https://github.com/wikimedia/revscoring/tree/master/revscoring/languages We're working on other things so we haven't gotten to it yet.

Regarding the wdvd tool you linked to, it looks like it is getting predictions using our wikidata model. Currently the wikidata model doesn't understand any nuances of Chinese, so it's making predictions using very limited information.

Got it, thank you.

In T109366#4984678, @Halfak wrote:

Hi @Tiger3018! I've added this to our backlog. Now that we have the basic list you have provided, we need to do the work to get it integrated into a language file like the ones you see here: https://github.com/wikimedia/revscoring/tree/master/revscoring/languages We're working on other things so we haven't gotten to it yet.

Regarding the wdvd tool you linked to, it looks like it is getting predictions using our wikidata model. Currently the wikidata model doesn't understand any nuances of Chinese, so it's making predictions using very limited information.

Halfak added a parent task: T223382: Improvements to ORES localization and support.May 22 2019, 6:20 PM

I worked with @zhuyifei1999 to develop https://etherpad.wikimedia.org/p/chinese_word_lists and then implemented it in https://github.com/wikimedia/revscoring/pull/438

I trained some damaging and goodfaith models. They are performing... OK. We're getting in the upper 80s for ROC-AUC. I would expect a solid model to be in the mid-90s so there's definitely some more work to do. But, it looks like these models will be *useful*. So I'll get a pull request together.

Halfak added a parent task: T224481: Train/test zhwiki editquality models.May 28 2019, 2:45 PM

Halfak closed this task as Resolved.Jul 2 2019, 2:47 PM

Halfak claimed this task.

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.

Shizhao moved this task from Backlog to Closed on the Chinese-Sites board.Jul 5 2019, 7:16 AM

calbon closed subtask T111179: Tokenization of "word" things for CJK as Resolved.Sep 23 2020, 4:17 PM

Stang unsubscribed.Nov 14 2021, 12:14 AM