Page MenuHomePhabricator

Chinese language utilities
Closed, ResolvedPublic

Description

  • generate TFiDF badword lists
  • review and aggregation of badwords/informal words by native speaker
  • implement revscoring.language.Language (Language utility)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ToAruShiroiNeko lowered the priority of this task from Unbreak Now! to Low.Aug 30 2015, 2:56 PM
ToAruShiroiNeko raised the priority of this task from Low to High.

@ToAruShiroiNeko, It looks like JimmyZu did some work on this list, but some of the items don't seem to make sense (e.g. "benjamin"). We also don't have any informals. Thoughts?

@ToAruShiroiNeko, It looks like JimmyZu did some work on this list, but some of the items don't seem to make sense (e.g. "benjamin"). We also don't have any informals. Thoughts?

Typo: His name is @jimmyxu

Halfak lowered the priority of this task from High to Lowest.Jul 6 2016, 3:10 PM

Asked a week earlier but I don't know this task until now… I think the "badwords" and the "informal" word lists should be updated with the help of local community. A notice has been published on our Village Pump.

Thanks @Shizhao! I'm excited to get the basic infra in place for zhwiki :)

It appears that the common word stats is quasi-incorrectly segmenting chinese words into single characters. Have you looked into anything like https://github.com/fxsjy/jieba for proper, pre-trained Chinese word segmentation?

Hi @Arthur2e5. I just started looking into word segmentation now (jieba and https://github.com/isnowfy/snownlp). It seems like this is certainly something that we can work with. @Ladsgroup, I'd like to make some changes to revscoring so that we can have chinese segmentation be common between our modeling and the Bad-Words-Detection-System process. Does that make sense to you? I think we might want a special module for CJK in revscoring.languages, but then again, maybe it should be fundamental to revscoring.features.tokenized. If we make it part of the core set of tokenized we get a lot of things for "free", but we'll need to make Chinese, Japanese, and Korean segmenters part of the general requirements. If we make a special module in revscoring.languages, you'd only need to load these libraries for feature sets that use them. But we'll need to re-implement basic diffing and token counting features.

The current "bad words" list contains a mix of simp/trad versions for these "steal QQ account" stuff. Are these duplicates necessary, or can the module figure out as mentioned in T110841?

Dexbot's version of words seems to have figured out some interesting words and phrases by using TF-IDF: https://meta.wikimedia.org/w/index.php?title=Research:Revision_scoring_as_a_service/Word_lists/zh&oldid=14070179 (But what happened to these WW2 stuff?)

Sorry for the late reply. We've been really backed up this quarter. I'm hoping to come back to this modeling work in a couple of weeks. Sorry for the delay!

back to it1 I'm looking through the bad and informal words. I should have a PR soon.

It looks like the informal term "和" appears quite often in the opening paragraph of the featured article for today:

2005年大西洋颶風季是有纪录以来最活跃的大西洋颶風季,至今仍保持着多项纪录。
全季对大范围地区造成毁灭性打击,共导致3,913人死亡,损失数额更创下新纪录,高达1592亿美元。
本季单大型飓风就有7场之多,其中5场在登陆时仍有大型飓风强度,分别是颶風丹尼斯、艾米莉、
卡特里娜、丽塔威尔玛,大部分人员伤亡财产损失都是这5场飓风引起。
墨西哥的金塔納羅奧州尤卡坦州 [...]

Am I reading this right?

@Liuxinyu970226, I'm trying to work out whether our informals detection strategy is just not going to work or if there's just some informal language used in the article I chose to use as a counter-example.

It is awesome but it's about the data for damaging model for ores, this task is about language assets (i.e. badwords and informal words)

The labeled dataset is the hard part! We should be able to gather some badwords/informals more quickly. Can someone address my question from T109366#4308813?

It looks like the informal term "和" appears quite often in the opening paragraph of the featured article for today:
Am I reading this right?

I think that "和" is a stop word.

I see. It looks like "和 (hé)" appears in the list of informal words here: https://resources.allsetlearning.com/chinese/grammar/Formal_and_informal_function_words

Are there any others that would be appropriate to see in an encyclopedia article that I should exclude from the list?

和(he2)by itself typically means “and”, so please treat it as a stop word as it is. Do not attempt any filtering on it.

Many other formal/informal terms provided in the list are arguably too strict, for example with the use of 所以 (so) /因此 (therefore) among others. Blindly using that list would turn zh.wp into some completely unreadable robotic-academic wordpile.

I see. So if I remove those three from the list, then it is otherwise representative of informal language?

Sorry for the late reply.

The present state of written Chinese is that everything we have now is basically derived from the vernacular -- the informal speak -- of circa late 19C-early 20C (or earlier, in the case of old vernacular novels), and there is almost intentionally little difference between what is spoken and what should be written. The vocabulary difference would mostly be some necessary jargon for certain concepts plus a few conventions, and anything beyond that -- especially many (conjunctives!) found in that list -- is occasionally considered superficial and pretentious. "We moved away from writing fake lzh almost a century ago to make it more human, so why would we go back at all?"

In my opinion what matters for Chinese is probably not a list or a classifier of "informal" expression, but something that detects basic bad/abusive words and spam/promotional speak. (Which is exactly the other thing we have been doing!) Informal filters are an analog of a human heuristic for information density that does not work too well in this language. (Transitional phrases that may indicate reasoning may as well be a good one, but on the other hand they also suggest the presence of original research…)


That said, if you do want a list of very informal words to avoid, check out Category:俗语 on zhwp along with the sub-categories. These slangs are basically what you go to urbandictionary for if they were in English -- probably new, usually not encyclopedic. I doubt if anyone is going to use many of these words at all, but that probably serves the purpose of the list as they do tend to raise a red flag for human patrollers.

This is very informative! I think that if "informal" language doesn't make sense for Chinese, let's not worry about that. Let's instead focus on "bad/abusive words and spam/promotional speak" as you suggest. What would be the best way to build a list of such words/phrases?

I'm not talented in computer science, but following things may be helpful.

It's better to generate a word-character list than the character/word only list, because many characters with diffrent adjacent words can make different meanings (like 广告/宣传/advertisment bad?, 告诉/传导/make someone learn about something common?). But there are many characters which can be used independently.( But nearly same meaning usually)

Also, too long compound word can be meaningless but useful for scanning spam. e.g. 公司(company)信仰(belief) seems like a promotional speak but 公司/信仰 are common words. Simpifiled and traditonal chinese can have much difference because of separated language environment.

So is it possible to use a open source program, just like jieba (on github, MIT), to solve the word-split problem? Or other thesis about chinese word segmentation?

About informal express: I think many Internet slang or promotional word can be treated as this type. e.g. 要知道 灌水 网友 躺赢 跑腿 刷粉 喵喵

This is really helpful. Thank you. I'm not sure if it would be practical, but with our pattern matching strategy, we can certainly specify the context around a word. What would it take to specify enough relevant context that we could catch spammy sequences like 公司(company)信仰(belief) but not just 信仰(belief)? I think this could be reasonable if there are relatively few patterns around 信仰(belief) that could be considered problematic.

For an example of how we do this for English Spammy phrases, see: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py#L228 I think the "regular expressions" are somewhat self-explanatory. On the line I linked, we would match "scholars believe" but not "Scholars from UC Berkeley" or "Albert Einstein believes".

OK with all of that said, I think word-splitting is a good option if we have a good library for it. What would that give us in this instance? It still seems like context is essential to matching effectively.

Yeah with regex recognizing spammy expression can be much easier. So I was worried about how to spilt word while machine scan the text and give the data. And these data to scanning rules, not zero to scanning rules?

By the way, happy Chinese new year. Cheers!!

\o/

OK so what would be a good way to get a nice set of example sequences that are problematic? We'll need to set up the regexes for Chinese the same way we set them up for English. I wonder if there is a Manual of Style page like https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Words_to_watch or maybe some abuse filter rules that we could draw from.

If there are no such lists, we could experiment with creating some on a wiki page somewhere. Then I can use that train ORES and we can see how it works. What do you think?

Great. So start to write the problemtic/words_to_watch expression regex now? (Or other expression/ALL)

As for Manual... Just click "Languages" tab and goto

(Saved for writing regexes)

Tiger3018 raised the priority of this task from Lowest to Medium.Feb 6 2019, 3:31 PM

This looks very useful. It's hard for me to interpret without some command of the language. Machine translation kind of jumbles up what I can get from it.

E.g., it does seem like "據說/據稱/聽說/傳說" is a good sequence to avoid. I'm not quite sure how to break that down though. For example, does "據稱" translate to "It is alleged"?

@Tiger3018 do you have the expertise to make a pass on turning these into regular expressions and examples sentences?

據稱/据称 like "It’s reported".

And making some regex rules is a easy thing for me (I think😂), but I need time to do it.

A simple regex rule: click here

(*)I have a problem: There are simplifiled and traditional Chinese used to write article, so should the matching rules include both expressions?

e.g. 據稱 - zh-hant / 据称 - zh-hans

I am thinking about using some preprocessing for Hans/hant -- we can flatten the text to one of the variants (character-by-charcter; no word based things so it is more predictable) and then perform the match.

Flattening could work if there is a 1:1 matching. It looks like this could work: https://pypi.org/project/zhconv/

I think it can help a lot. Most transfering can work in the rules with 1:1 matching, but it need human checking.

I think the best next step is to provide a starting list of regexes for badwords, informals, or words to watch. Then we can run some experiments with zhconv to see if we're doing OK.

Halfak raised the priority of this task from Medium to High.Feb 19 2019, 10:30 PM
Halfak lowered the priority of this task from High to Medium.

So what's going on about this issue?

Need a PR with python source on github, a more detailed / traditional Chinese support list or something else?

(+) A question from Shizhao : this website can showing the ORES score for editing in Chinese languages, why?

Hi @Tiger3018! I've added this to our backlog. Now that we have the basic list you have provided, we need to do the work to get it integrated into a language file like the ones you see here: https://github.com/wikimedia/revscoring/tree/master/revscoring/languages We're working on other things so we haven't gotten to it yet.

Regarding the wdvd tool you linked to, it looks like it is getting predictions using our wikidata model. Currently the wikidata model doesn't understand any nuances of Chinese, so it's making predictions using very limited information.

Got it, thank you.

Hi @Tiger3018! I've added this to our backlog. Now that we have the basic list you have provided, we need to do the work to get it integrated into a language file like the ones you see here: https://github.com/wikimedia/revscoring/tree/master/revscoring/languages We're working on other things so we haven't gotten to it yet.

Regarding the wdvd tool you linked to, it looks like it is getting predictions using our wikidata model. Currently the wikidata model doesn't understand any nuances of Chinese, so it's making predictions using very limited information.

I trained some damaging and goodfaith models. They are performing... OK. We're getting in the upper 80s for ROC-AUC. I would expect a solid model to be in the mid-90s so there's definitely some more work to do. But, it looks like these models will be *useful*. So I'll get a pull request together.

Halfak claimed this task.
Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.