Purpose of this task is to add the words_to_watch feature to articlequality and draftquality models in ptwiki to help improve model fitness.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Chtnnh | T247847 Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis | |||
Resolved | Chtnnh | T251171 Add `words_to_watch` to articlequality and draftquality models in ptwiki |
Event Timeline
After adding words_to_watch to draftquality we did not achieve any significant fitness improvement. This is evident in the tuning_report diff in this PR: https://github.com/wikimedia/draftquality/pull/39
Hence we are going ahead by not merging words_to_watch in draftquality
@He7d3r, I'm surprised this didn't work. I would expect that many vandalism or spam articles would have phrases from words_to_watch in them. What do you think?
That is odd. Does this tuning report reflect only the changes in the ptwiki features, or does it also include other articles to the dataset as mentioned at T246667#6067366?
@GoEThe Correct me if I'm mistaken, but I believe a reasonable amount of new articles having vandalism or spam would contain expressions such as the words_to_watch mentioned by Halfak. For reference, the expressions are listed at
https://github.com/wikimedia/revscoring/blob/76c737f2998bbba5b5dd942823f43383f1a4b47e/revscoring/languages/portuguese.py#L153-L189
It does include the new articles matched beyond the ER# tags. Could it be possible that we're not matching the features effectively? Maybe we could generate a sample of articles and the values of the features the spam and vandalism articles. Maybe there's a bug in the extraction that is hard to see.
It could be. For example, @Darwinius noticed that images loaded from Wikidata are not counted:
https://www.mediawiki.org/wiki/ORES/Issues/Article_quality?diff=3804470
I wouldn't be surprised if there was some obscure problem with feature extraction.
I would imagine that we would, unless people are too loose with the deletion reason and are using that as a catch-all reason to delete something. Could we get a list of most common words in the test articles as a sanity check?
Here's a random sample of 100 articles from the dataset. Columns are: label, rev_id, words_to_watch detected
OK 54564977 ['chamada'] OK 54580707 [] OK 54588900 [] OK 54591960 ['excelentes'] OK 54654115 ['famosa', 'famosas'] OK 54878571 ['Grande'] OK 54942029 ['Grande'] OK 54947718 [] OK 55073110 [] OK 55178315 [] OK 55414624 [] OK 55479293 [] OK 55506019 [] OK 55619553 [] OK 55757315 [] OK 55826735 [] OK 55848369 [] OK 55933425 [] OK 56081483 [] OK 56125324 [] OK 56176371 ['notável', 'Excelente', 'líder', 'grandes', 'grandes', 'grandes'] OK 56314638 [] OK 56452535 [] OK 56870206 [] OK 56971243 [] OK 57110147 [] OK 57135536 ['grandes'] OK 57144166 [] OK 57149152 [] OK 57226988 [] OK 57265672 [] OK 57283427 [] OK 57330851 ['Grande', 'famosos', 'Grande', 'Grande', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'Líder', 'chamada', 'suposto', 'chamada', 'afirmou', 'acusado', 'acusado', 'acusado', 'Líder', 'revelou', 'suposto', 'chamados', 'Líder', 'Líder', 'Líder', 'Líder', 'líder', 'líder', 'líder', 'indicou', 'líder', 'líder', 'indicou', 'indicou', 'líder', 'líder', 'líder', 'Grande', 'famosos', 'famosos', 'Líder', 'Grande'] OK 57372881 ['chamado'] OK 57424701 ['sem dúvida', 'grandes', 'grandes', 'grande', 'Grande', 'grande', 'supostos'] OK 57450520 [] OK 57823981 [] spam 54688128 ['grandes', 'grande', 'respeitadas'] spam 54731096 [] spam 54797077 ['grande', 'grande'] spam 54879988 [] spam 54913065 [] spam 55244330 [] spam 55262798 [] spam 55263988 ['Grandes', 'grandes'] spam 55421009 [] spam 55484414 [] spam 55683016 [] spam 55851445 ['grande'] spam 56008069 [] spam 56318879 ['grande'] spam 56678797 [] spam 56884328 ['grande'] spam 56924708 ['grande'] spam 56973202 [] spam 57026740 [] spam 57055228 [] spam 57154747 [] spam 57281089 [] spam 57306399 [] spam 57837386 [] unsuitable 54590790 ['Infelizmente'] unsuitable 54608146 [] unsuitable 54660892 [] unsuitable 54719476 ['chamado'] unsuitable 54792698 ['grande', 'controversa'] unsuitable 54826551 [] unsuitable 54876051 [] unsuitable 54963441 ['Faleceu'] unsuitable 55148384 [] unsuitable 55181614 ['culto', 'líderes'] unsuitable 55501273 [] unsuitable 55633156 [] unsuitable 55643549 [] unsuitable 55648574 [] unsuitable 55689341 ['faleceu', 'chamado'] unsuitable 55973450 [] unsuitable 55988404 ['Grande'] unsuitable 56034569 ['grandes'] unsuitable 56139058 [] unsuitable 56150859 ['grande'] unsuitable 56160368 [] unsuitable 56320158 [] unsuitable 56441736 [] unsuitable 56602603 [] unsuitable 56635694 ['grande'] unsuitable 56677216 [] unsuitable 56737349 ['grande', 'famosa', 'grande'] unsuitable 56742550 [] unsuitable 56947897 [] unsuitable 57048833 [] unsuitable 57130515 [] unsuitable 57369176 ['grande', 'grande'] unsuitable 57396413 ['Grande', 'grandes'] unsuitable 57423925 ['chamado'] unsuitable 57435870 [] unsuitable 57473392 [] unsuitable 57642014 [] unsuitable 57732795 [] unsuitable 57740789 ['certamente', 'grande']
Here's the "feature importances" as reported for @Chtnnh's model:
0.0 feature.ptwiki.revision.category_links 0.0 feature.(ptwiki.revision.category_links / max(wikitext.revision.content_chars, 1)) 0.0 feature.ptwiki.revision.cn_templates 0.0 feature.(ptwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1)) 0.0 feature.(wikitext.revision.entity_chars / max(wikitext.revision.chars, 1)) 0.0 feature.len(<datasource.wikitext.revision.entities>) 0.0 feature.(len(<datasource.wikitext.revision.cjks>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.0 feature.ptwiki.main_article_templates 0.0 feature.wikitext.revision.cjk_chars 0.0 feature.(len(<datasource.wikitext.revision.entities>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.0 feature.wikitext.revision.entity_chars 0.0 feature.len(<datasource.wikitext.revision.cjks>) 0.0 feature.(wikitext.revision.cjk_chars / max(wikitext.revision.chars, 1)) 0.001 feature.(ptwiki.main_article_templates / max(wikitext.revision.content_chars, 1)) 0.001 feature.ptwiki.revision.image_links 0.001 feature.len(<datasource.portuguese.badwords.revision.matches>) 0.002 feature.len(<datasource.portuguese.informals.revision.matches>) 0.002 feature.(len(<datasource.portuguese.informals.revision.matches>) / max(len(<datasource.wikitext.revision.words>), 1)) 0.003 feature.ptwiki.revision.cite_templates 0.003 feature.(len(<datasource.portuguese.badwords.revision.matches>) / max(len(<datasource.wikitext.revision.words>), 1)) 0.003 feature.len(<datasource.portuguese.words_to_watch.revision.matches>) 0.003 feature.wikitext.revision.headings 0.004 feature.ptwiki.revision.infobox_templates 0.004 feature.max((wikitext.revision.ref_tags - ptwiki.revision.cite_templates), 0) 0.004 feature.(ptwiki.revision.cite_templates / max(wikitext.revision.ref_tags, 1)) 0.004 feature.(ptwiki.revision.image_links / max(wikitext.revision.content_chars, 1)) 0.006 feature.len(<datasource.wikitext.revision.uppercase_words>) 0.006 feature.(len(<datasource.portuguese.words_to_watch.revision.matches>) / max(len(<datasource.wikitext.revision.words>), 1)) 0.007 feature.wikitext.revision.ref_tags 0.007 feature.len(<datasource.wikitext.revision.breaks>) 0.008 feature.wikitext.revision.break_chars 0.008 feature.wikitext.revision.longest_repeated_char 0.008 feature.(ptwiki.revision.cite_templates / max(wikitext.revision.content_chars, 1)) 0.008 feature.(max((wikitext.revision.ref_tags - ptwiki.revision.cite_templates), 0) / max(wikitext.revision.content_chars, 1)) 0.008 feature.(ptwiki.revision.non_cite_templates / max(wikitext.revision.content_chars, 1)) 0.009 feature.(wikitext.revision.headings / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.009 feature.wikitext.revision.uppercase_word_chars 0.009 feature.len(<datasource.wikitext.revision.urls>) 0.009 feature.ptwiki.revision.non_cite_templates 0.009 feature.(wikitext.revision.ref_tags / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.009 feature.wikitext.revision.tags 0.01 feature.(wikitext.revision.uppercase_word_chars / max(wikitext.revision.chars, 1)) 0.01 feature.wikitext.revision.punctuation_chars 0.011 feature.len(<datasource.portuguese.dictionary.revision.non_dict_words>) 0.011 feature.max(<datasource.map(<built-in function len>, <datasource.wikitext.revision.words>)>) 0.011 feature.len(<datasource.wikitext.revision.numbers>) 0.012 feature.wikitext.revision.wikilinks 0.012 feature.len(<datasource.tokenized(datasource.revision.text)>) 0.012 feature.len(<datasource.wikitext.revision.whitespaces>) 0.012 feature.(max(<datasource.map(<built-in function len>, <datasource.wikitext.revision.words>)>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.012 feature.wikitext.revision.url_chars 0.013 feature.len(<datasource.portuguese.dictionary.revision.dict_words>) 0.013 feature.(max(<datasource.map(<built-in function len>, <datasource.tokenized(datasource.revision.text)>)>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.013 feature.(wikitext.revision.templates / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.013 feature.len(<datasource.wikitext.revision.punctuations>) 0.014 feature.(len(<datasource.wikitext.revision.uppercase_words>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.014 feature.len(<datasource.wikitext.revision.markups>) 0.014 feature.wikitext.revision.word_chars 0.014 feature.wikitext.revision.external_links 0.015 feature.wikitext.revision.markup_chars 0.015 feature.wikitext.revision.whitespace_chars 0.015 feature.(wikitext.revision.whitespace_chars / max(wikitext.revision.chars, 1)) 0.016 feature.(wikitext.revision.external_links / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.016 feature.len(<datasource.wikitext.revision.words>) 0.016 feature.(wikitext.revision.break_chars / max(wikitext.revision.chars, 1)) 0.016 feature.(wikitext.revision.tags / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.016 feature.wikitext.revision.content_chars 0.017 feature.(wikitext.revision.url_chars / max(wikitext.revision.chars, 1)) 0.017 feature.(len(<datasource.wikitext.revision.words>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.017 feature.(len(<datasource.wikitext.revision.numbers>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.018 feature.(len(<datasource.wikitext.revision.breaks>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.018 feature.(wikitext.revision.word_chars / max(wikitext.revision.chars, 1)) 0.019 feature.wikitext.revision.chars 0.019 feature.(wikitext.revision.markup_chars / max(wikitext.revision.chars, 1)) 0.02 feature.(len(<datasource.wikitext.revision.markups>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.02 feature.(len(<datasource.wikitext.revision.urls>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.022 feature.(portuguese.stemmed.revision.stems_length / max(wikitext.revision.content_chars, 1)) 0.022 feature.(wikitext.revision.content_chars / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.022 feature.(wikitext.revision.punctuation_chars / max(wikitext.revision.chars, 1)) 0.025 feature.max(<datasource.map(<built-in function len>, <datasource.tokenized(datasource.revision.text)>)>) 0.028 feature.(len(<datasource.wikitext.revision.punctuations>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.029 feature.(wikitext.revision.longest_repeated_char / max(wikitext.revision.chars, 1)) 0.029 feature.(wikitext.revision.wikilinks / max(len(<datasource.tokenized(datasource.revision.text)>), 1)) 0.031 feature.(len(<datasource.portuguese.dictionary.revision.dict_words>) / max(len(<datasource.wikitext.revision.words>), 1)) 0.036 feature.(len(<datasource.portuguese.dictionary.revision.non_dict_words>) / max(len(<datasource.wikitext.revision.words>), 1)) 0.037 feature.wikitext.revision.templates 0.054 feature.(len(<datasource.wikitext.revision.whitespaces>) / max(len(<datasource.tokenized(datasource.revision.text)>), 1))
Looks like words_to_watch gets more importance than badwords and informals!
https://github.com/wikimedia/articlequality/pull/121
Here is the articlequality code for review.
See https://github.com/wikimedia/articlequality/pull/122 for another possible explanation for the problem:
I didn't train extract/retrain the model after the change to verify its impact on the metrics, but I believe it might help by improving the dataset quality.
It occurred to me that some of these expressions are also used by Salebot¹, with the difference that in the bot config² users assign a score to each word/regex indicating how much it contributes towards classifying an edit as needing to be reverted. This allows it to "ignore" words which are common in good edits, unless there are too many of them.
¹ There is a version of the source code at https://phabricator.wikimedia.org/diffusion/TSVN/browse/gribeco/salebot2/, but I'm not sure if it is the latest version, given that its latest update was in 2014
² See w:pt:User:Salebot/Config and w:fr:User:Salebot/Config