Page MenuHomePhabricator

Add `words_to_watch` to articlequality and draftquality models in ptwiki
Closed, ResolvedPublic

Description

Purpose of this task is to add the words_to_watch feature to articlequality and draftquality models in ptwiki to help improve model fitness.

Event Timeline

After adding words_to_watch to draftquality we did not achieve any significant fitness improvement. This is evident in the tuning_report diff in this PR: https://github.com/wikimedia/draftquality/pull/39

Hence we are going ahead by not merging words_to_watch in draftquality

@He7d3r, I'm surprised this didn't work. I would expect that many vandalism or spam articles would have phrases from words_to_watch in them. What do you think?

That is odd. Does this tuning report reflect only the changes in the ptwiki features, or does it also include other articles to the dataset as mentioned at T246667#6067366?

@GoEThe Correct me if I'm mistaken, but I believe a reasonable amount of new articles having vandalism or spam would contain expressions such as the words_to_watch mentioned by Halfak. For reference, the expressions are listed at
https://github.com/wikimedia/revscoring/blob/76c737f2998bbba5b5dd942823f43383f1a4b47e/revscoring/languages/portuguese.py#L153-L189

@He7d3r, I'm surprised this didn't work. I would expect that many vandalism or spam articles would have phrases from words_to_watch in them. What do you think?

It does include the new articles matched beyond the ER# tags. Could it be possible that we're not matching the features effectively? Maybe we could generate a sample of articles and the values of the features the spam and vandalism articles. Maybe there's a bug in the extraction that is hard to see.

It could be. For example, @Darwinius noticed that images loaded from Wikidata are not counted:
https://www.mediawiki.org/wiki/ORES/Issues/Article_quality?diff=3804470
I wouldn't be surprised if there was some obscure problem with feature extraction.

I would imagine that we would, unless people are too loose with the deletion reason and are using that as a catch-all reason to delete something. Could we get a list of most common words in the test articles as a sanity check?

Here's a random sample of 100 articles from the dataset. Columns are: label, rev_id, words_to_watch detected

OK	54564977	['chamada']
OK	54580707	[]
OK	54588900	[]
OK	54591960	['excelentes']
OK	54654115	['famosa', 'famosas']
OK	54878571	['Grande']
OK	54942029	['Grande']
OK	54947718	[]
OK	55073110	[]
OK	55178315	[]
OK	55414624	[]
OK	55479293	[]
OK	55506019	[]
OK	55619553	[]
OK	55757315	[]
OK	55826735	[]
OK	55848369	[]
OK	55933425	[]
OK	56081483	[]
OK	56125324	[]
OK	56176371	['notável', 'Excelente', 'líder', 'grandes', 'grandes', 'grandes']
OK	56314638	[]
OK	56452535	[]
OK	56870206	[]
OK	56971243	[]
OK	57110147	[]
OK	57135536	['grandes']
OK	57144166	[]
OK	57149152	[]
OK	57226988	[]
OK	57265672	[]
OK	57283427	[]
OK	57330851	['Grande', 'famosos', 'Grande', 'Grande', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'chamada', 'suposto', 'chamada', 'afirmou', 'acusado', 'acusado', 'acusado', 'Líder', 'revelou', 'suposto', 'chamados', 'Líder', 'Líder', 'Líder', 'Líder', 'líder', 'líder', 'líder', 'indicou', 'líder', 'líder', 'indicou', 'indicou', 'líder', 'líder', 'líder', 'Grande', 'famosos', 'famosos', 'Líder', 'Grande']
OK	57372881	['chamado']
OK	57424701	['sem dúvida', 'grandes', 'grandes', 'grande', 'Grande', 'grande', 'supostos']
OK	57450520	[]
OK	57823981	[]
spam	54688128	['grandes', 'grande', 'respeitadas']
spam	54731096	[]
spam	54797077	['grande', 'grande']
spam	54879988	[]
spam	54913065	[]
spam	55244330	[]
spam	55262798	[]
spam	55263988	['Grandes', 'grandes']
spam	55421009	[]
spam	55484414	[]
spam	55683016	[]
spam	55851445	['grande']
spam	56008069	[]
spam	56318879	['grande']
spam	56678797	[]
spam	56884328	['grande']
spam	56924708	['grande']
spam	56973202	[]
spam	57026740	[]
spam	57055228	[]
spam	57154747	[]
spam	57281089	[]
spam	57306399	[]
spam	57837386	[]
unsuitable	54590790	['Infelizmente']
unsuitable	54608146	[]
unsuitable	54660892	[]
unsuitable	54719476	['chamado']
unsuitable	54792698	['grande', 'controversa']
unsuitable	54826551	[]
unsuitable	54876051	[]
unsuitable	54963441	['Faleceu']
unsuitable	55148384	[]
unsuitable	55181614	['culto', 'líderes']
unsuitable	55501273	[]
unsuitable	55633156	[]
unsuitable	55643549	[]
unsuitable	55648574	[]
unsuitable	55689341	['faleceu', 'chamado']
unsuitable	55973450	[]
unsuitable	55988404	['Grande']
unsuitable	56034569	['grandes']
unsuitable	56139058	[]
unsuitable	56150859	['grande']
unsuitable	56160368	[]
unsuitable	56320158	[]
unsuitable	56441736	[]
unsuitable	56602603	[]
unsuitable	56635694	['grande']
unsuitable	56677216	[]
unsuitable	56737349	['grande', 'famosa', 'grande']
unsuitable	56742550	[]
unsuitable	56947897	[]
unsuitable	57048833	[]
unsuitable	57130515	[]
unsuitable	57369176	['grande', 'grande']
unsuitable	57396413	['Grande', 'grandes']
unsuitable	57423925	['chamado']
unsuitable	57435870	[]
unsuitable	57473392	[]
unsuitable	57642014	[]
unsuitable	57732795	[]
unsuitable	57740789	['certamente', 'grande']

Here's the "feature importances" as reported for @Chtnnh's model:

0.0 feature.ptwiki.revision.category_links
0.0 feature.(ptwiki.revision.category_links / max(wikitext.revision.content_chars, 1))
0.0 feature.ptwiki.revision.cn_templates
0.0 feature.(ptwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))
0.0 feature.(wikitext.revision.entity_chars / max(wikitext.revision.chars, 1))
0.0 feature.len(<datasource.wikitext.revision.entities>)
0.0 feature.(len(<datasource.wikitext.revision.cjks>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.0 feature.ptwiki.main_article_templates
0.0 feature.wikitext.revision.cjk_chars
0.0 feature.(len(<datasource.wikitext.revision.entities>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.0 feature.wikitext.revision.entity_chars
0.0 feature.len(<datasource.wikitext.revision.cjks>)
0.0 feature.(wikitext.revision.cjk_chars / max(wikitext.revision.chars, 1))
0.001 feature.(ptwiki.main_article_templates / max(wikitext.revision.content_chars, 1))
0.001 feature.ptwiki.revision.image_links
0.001 feature.len(<datasource.portuguese.badwords.revision.matches>)
0.002 feature.len(<datasource.portuguese.informals.revision.matches>)
0.002 feature.(len(<datasource.portuguese.informals.revision.matches>) / max(len(<datasource.wikitext.revision.words>), 1))
0.003 feature.ptwiki.revision.cite_templates
0.003 feature.(len(<datasource.portuguese.badwords.revision.matches>) / max(len(<datasource.wikitext.revision.words>), 1))
0.003 feature.len(<datasource.portuguese.words_to_watch.revision.matches>)
0.003 feature.wikitext.revision.headings
0.004 feature.ptwiki.revision.infobox_templates
0.004 feature.max((wikitext.revision.ref_tags - ptwiki.revision.cite_templates), 0)
0.004 feature.(ptwiki.revision.cite_templates / max(wikitext.revision.ref_tags, 1))
0.004 feature.(ptwiki.revision.image_links / max(wikitext.revision.content_chars, 1))
0.006 feature.len(<datasource.wikitext.revision.uppercase_words>)
0.006 feature.(len(<datasource.portuguese.words_to_watch.revision.matches>) / max(len(<datasource.wikitext.revision.words>), 1))
0.007 feature.wikitext.revision.ref_tags
0.007 feature.len(<datasource.wikitext.revision.breaks>)
0.008 feature.wikitext.revision.break_chars
0.008 feature.wikitext.revision.longest_repeated_char
0.008 feature.(ptwiki.revision.cite_templates / max(wikitext.revision.content_chars, 1))
0.008 feature.(max((wikitext.revision.ref_tags - ptwiki.revision.cite_templates), 0) / max(wikitext.revision.content_chars, 1))
0.008 feature.(ptwiki.revision.non_cite_templates / max(wikitext.revision.content_chars, 1))
0.009 feature.(wikitext.revision.headings / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.009 feature.wikitext.revision.uppercase_word_chars
0.009 feature.len(<datasource.wikitext.revision.urls>)
0.009 feature.ptwiki.revision.non_cite_templates
0.009 feature.(wikitext.revision.ref_tags / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.009 feature.wikitext.revision.tags
0.01 feature.(wikitext.revision.uppercase_word_chars / max(wikitext.revision.chars, 1))
0.01 feature.wikitext.revision.punctuation_chars
0.011 feature.len(<datasource.portuguese.dictionary.revision.non_dict_words>)
0.011 feature.max(<datasource.map(<built-in function len>, <datasource.wikitext.revision.words>)>)
0.011 feature.len(<datasource.wikitext.revision.numbers>)
0.012 feature.wikitext.revision.wikilinks
0.012 feature.len(<datasource.tokenized(datasource.revision.text)>)
0.012 feature.len(<datasource.wikitext.revision.whitespaces>)
0.012 feature.(max(<datasource.map(<built-in function len>, <datasource.wikitext.revision.words>)>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.012 feature.wikitext.revision.url_chars
0.013 feature.len(<datasource.portuguese.dictionary.revision.dict_words>)
0.013 feature.(max(<datasource.map(<built-in function len>, <datasource.tokenized(datasource.revision.text)>)>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.013 feature.(wikitext.revision.templates / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.013 feature.len(<datasource.wikitext.revision.punctuations>)
0.014 feature.(len(<datasource.wikitext.revision.uppercase_words>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.014 feature.len(<datasource.wikitext.revision.markups>)
0.014 feature.wikitext.revision.word_chars
0.014 feature.wikitext.revision.external_links
0.015 feature.wikitext.revision.markup_chars
0.015 feature.wikitext.revision.whitespace_chars
0.015 feature.(wikitext.revision.whitespace_chars / max(wikitext.revision.chars, 1))
0.016 feature.(wikitext.revision.external_links / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.016 feature.len(<datasource.wikitext.revision.words>)
0.016 feature.(wikitext.revision.break_chars / max(wikitext.revision.chars, 1))
0.016 feature.(wikitext.revision.tags / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.016 feature.wikitext.revision.content_chars
0.017 feature.(wikitext.revision.url_chars / max(wikitext.revision.chars, 1))
0.017 feature.(len(<datasource.wikitext.revision.words>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.017 feature.(len(<datasource.wikitext.revision.numbers>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.018 feature.(len(<datasource.wikitext.revision.breaks>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.018 feature.(wikitext.revision.word_chars / max(wikitext.revision.chars, 1))
0.019 feature.wikitext.revision.chars
0.019 feature.(wikitext.revision.markup_chars / max(wikitext.revision.chars, 1))
0.02 feature.(len(<datasource.wikitext.revision.markups>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.02 feature.(len(<datasource.wikitext.revision.urls>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.022 feature.(portuguese.stemmed.revision.stems_length / max(wikitext.revision.content_chars, 1))
0.022 feature.(wikitext.revision.content_chars / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.022 feature.(wikitext.revision.punctuation_chars / max(wikitext.revision.chars, 1))
0.025 feature.max(<datasource.map(<built-in function len>, <datasource.tokenized(datasource.revision.text)>)>)
0.028 feature.(len(<datasource.wikitext.revision.punctuations>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.029 feature.(wikitext.revision.longest_repeated_char / max(wikitext.revision.chars, 1))
0.029 feature.(wikitext.revision.wikilinks / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
0.031 feature.(len(<datasource.portuguese.dictionary.revision.dict_words>) / max(len(<datasource.wikitext.revision.words>), 1))
0.036 feature.(len(<datasource.portuguese.dictionary.revision.non_dict_words>) / max(len(<datasource.wikitext.revision.words>), 1))
0.037 feature.wikitext.revision.templates
0.054 feature.(len(<datasource.wikitext.revision.whitespaces>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))

Looks like words_to_watch gets more importance than badwords and informals!

See https://github.com/wikimedia/articlequality/pull/122 for another possible explanation for the problem:

It does include the new articles matched beyond the ER# tags. Could it be possible that we're not matching the features effectively? Maybe we could generate a sample of articles and the values of the features the spam and vandalism articles. Maybe there's a bug in the extraction that is hard to see.

I didn't train extract/retrain the model after the change to verify its impact on the metrics, but I believe it might help by improving the dataset quality.

Indeed. When I merged that PR, it had minor positive effects on quality.

It occurred to me that some of these expressions are also used by Salebot¹, with the difference that in the bot config² users assign a score to each word/regex indicating how much it contributes towards classifying an edit as needing to be reverted. This allows it to "ignore" words which are common in good edits, unless there are too many of them.

¹ There is a version of the source code at https://phabricator.wikimedia.org/diffusion/TSVN/browse/gribeco/salebot2/, but I'm not sure if it is the latest version, given that its latest update was in 2014
² See w:pt:User:Salebot/Config and w:fr:User:Salebot/Config

Task has been completed and model has shown improvement in fitness.