- generate TFiDF badword lists
- review and aggregation of badwords/informal words by native speaker
- implement revscoring.language.Language (Language utility)
Hi @Arthur2e5. I just started looking into word segmentation now (jieba and https://github.com/isnowfy/snownlp). It seems like this is certainly something that we can work with. @Ladsgroup, I'd like to make some changes to revscoring so that we can have chinese segmentation be common between our modeling and the Bad-Words-Detection-System process. Does that make sense to you? I think we might want a special module for CJK in revscoring.languages, but then again, maybe it should be fundamental to revscoring.features.tokenized. If we make it part of the core set of tokenized we get a lot of things for "free", but we'll need to make Chinese, Japanese, and Korean segmenters part of the general requirements. If we make a special module in revscoring.languages, you'd only need to load these libraries for feature sets that use them. But we'll need to re-implement basic diffing and token counting features.
The current "bad words" list contains a mix of simp/trad versions for these "steal QQ account" stuff. Are these duplicates necessary, or can the module figure out as mentioned in T110841?
Dexbot's version of words seems to have figured out some interesting words and phrases by using TF-IDF: https://meta.wikimedia.org/w/index.php?title=Research:Revision_scoring_as_a_service/Word_lists/zh&oldid=14070179 (But what happened to these WW2 stuff?)
It looks like the informal term "和" appears quite often in the opening paragraph of the featured article for today:
Am I reading this right?
I see. It looks like "和 (hé)" appears in the list of informal words here: https://resources.allsetlearning.com/chinese/grammar/Formal_and_informal_function_words
Are there any others that would be appropriate to see in an encyclopedia article that I should exclude from the list?
和（he2）by itself typically means “and”, so please treat it as a stop word as it is. Do not attempt any filtering on it.
Many other formal/informal terms provided in the list are arguably too strict, for example with the use of 所以 (so) /因此 (therefore) among others. Blindly using that list would turn zh.wp into some completely unreadable robotic-academic wordpile.
Sorry for the late reply.
The present state of written Chinese is that everything we have now is basically derived from the vernacular -- the informal speak -- of circa late 19C-early 20C (or earlier, in the case of old vernacular novels), and there is almost intentionally little difference between what is spoken and what should be written. The vocabulary difference would mostly be some necessary jargon for certain concepts plus a few conventions, and anything beyond that -- especially many (conjunctives!) found in that list -- is occasionally considered superficial and pretentious. "We moved away from writing fake lzh almost a century ago to make it more human, so why would we go back at all?"
In my opinion what matters for Chinese is probably not a list or a classifier of "informal" expression, but something that detects basic bad/abusive words and spam/promotional speak. (Which is exactly the other thing we have been doing!) Informal filters are an analog of a human heuristic for information density that does not work too well in this language. (Transitional phrases that may indicate reasoning may as well be a good one, but on the other hand they also suggest the presence of original research…)
That said, if you do want a list of very informal words to avoid, check out Category:俗语 on zhwp along with the sub-categories. These slangs are basically what you go to urbandictionary for if they were in English -- probably new, usually not encyclopedic. I doubt if anyone is going to use many of these words at all, but that probably serves the purpose of the list as they do tend to raise a red flag for human patrollers.
This is very informative! I think that if "informal" language doesn't make sense for Chinese, let's not worry about that. Let's instead focus on "bad/abusive words and spam/promotional speak" as you suggest. What would be the best way to build a list of such words/phrases?
I'm not talented in computer science, but following things may be helpful.
It's better to generate a word-character list than the character/word only list, because many characters with diffrent adjacent words can make different meanings (like 广告/宣传/advertisment bad?, 告诉/传导/make someone learn about something common?). But there are many characters which can be used independently.( But nearly same meaning usually)
Also, too long compound word can be meaningless but useful for scanning spam. e.g. 公司(company)信仰(belief) seems like a promotional speak but 公司/信仰 are common words. Simpifiled and traditonal chinese can have much difference because of separated language environment.
So is it possible to use a open source program, just like jieba (on github, MIT), to solve the word-split problem? Or other thesis about chinese word segmentation?
About informal express: I think many Internet slang or promotional word can be treated as this type. e.g. 要知道 灌水 网友 躺赢 跑腿 刷粉 喵喵
This is really helpful. Thank you. I'm not sure if it would be practical, but with our pattern matching strategy, we can certainly specify the context around a word. What would it take to specify enough relevant context that we could catch spammy sequences like 公司(company)信仰(belief) but not just 信仰(belief)? I think this could be reasonable if there are relatively few patterns around 信仰(belief) that could be considered problematic.
For an example of how we do this for English Spammy phrases, see: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py#L228 I think the "regular expressions" are somewhat self-explanatory. On the line I linked, we would match "scholars believe" but not "Scholars from UC Berkeley" or "Albert Einstein believes".
OK with all of that said, I think word-splitting is a good option if we have a good library for it. What would that give us in this instance? It still seems like context is essential to matching effectively.
Yeah with regex recognizing spammy expression can be much easier. So I was worried about how to spilt word while machine scan the text and give the data. And these data to scanning rules, not zero to scanning rules?
By the way, happy Chinese new year. Cheers!!
OK so what would be a good way to get a nice set of example sequences that are problematic? We'll need to set up the regexes for Chinese the same way we set them up for English. I wonder if there is a Manual of Style page like https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Words_to_watch or maybe some abuse filter rules that we could draw from.
If there are no such lists, we could experiment with creating some on a wiki page somewhere. Then I can use that train ORES and we can see how it works. What do you think?
This looks very useful. It's hard for me to interpret without some command of the language. Machine translation kind of jumbles up what I can get from it.
E.g., it does seem like "據說／據稱／聽說／傳說" is a good sequence to avoid. I'm not quite sure how to break that down though. For example, does "據稱" translate to "It is alleged"?
@Tiger3018 do you have the expertise to make a pass on turning these into regular expressions and examples sentences?
I am thinking about using some preprocessing for Hans/hant -- we can flatten the text to one of the variants (character-by-charcter; no word based things so it is more predictable) and then perform the match.
Hi @Tiger3018! I've added this to our backlog. Now that we have the basic list you have provided, we need to do the work to get it integrated into a language file like the ones you see here: https://github.com/wikimedia/revscoring/tree/master/revscoring/languages We're working on other things so we haven't gotten to it yet.
Regarding the wdvd tool you linked to, it looks like it is getting predictions using our wikidata model. Currently the wikidata model doesn't understand any nuances of Chinese, so it's making predictions using very limited information.