Page MenuHomePhabricator

Build draft quality model for ptwikipedia
Closed, ResolvedPublic

Description

What tools do editors use to find and delete the worst types of new pages (spam, vandalism, and personal attacks)?

Huggle is one of the main tools that people use on ptwikipedia for RC patrolling. Also RTRC, FastButtons (https://pt.wikipedia.org/wiki/MediaWiki:Gadget-fastbuttons.js). There may be others.

Is there a common way that deletion reasons are recorded in the log? For example, in English Wikipedia, they include names for each deletion reason in the log (e.g "WP:CSD#G11" means the article was deleted as "spam").

Yes, Deletereason-dropdown is commonly used: https://pt.wikipedia.org/wiki/MediaWiki:Deletereason-dropdown.
This includes the formal reasons for speedy deletion: https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Elimina%C3%A7%C3%A3o_r%C3%A1pida#Regras_formais

These include a one letter prefix indicating namespace (with some execptions): A (Article namespace), C (Category mainspace), D (Discussion or talk mainspace), U (User namespace), P (Predefinition or template namespace), R (redirect) and G (technical deletion). There are also legacy rules that may still be used in some cases, these are indicated with "ER" + the number of the rule.

The numbers indicate the reason. I translate here the relevant ones:

A1 - Article with gibberish title
A2 - Article with no context
A3 - Article with no content
A4 - Article with no statement regarding notability (people, animals, organizations, web content, events)
A5 - Article with no statement regarding notability (music, books)
A6 - Duplicated content

ER3 - Absurd titles, content moved to another article
ER4 - Absurd titles in an encyclpedia, content moved to another article
ER5 - Recurring content (same content as previously deleted)
ER6 - Spam, publicity, or pamphlet
ER13 - Copyright violations
ER20 - Unsuitable content

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We're mostly looking to catch "spam", "vandalism", and other types of problematic drafts that are obvious from the content itself. In other words, you don't need any subject matter expertise or external references to figure out that something is wrong. This is due to limitations in what the model can be expected to figure out.

So copyright is hard because the only way to know for sure is to find where it was copied from.

Would we be likely to find vandalism under ER20 "Unsuitable content"? It seems like ER6 is the only item that covers spam and advertising. Is that right?

Yes, you would tag vandalism with ER20. ER20 may also be used for spam and advertising.

There are some additional standard deletion reasons that can be used for spam or vandalism, available in the dropdown menu:

  • Spam / propaganda / proselitismo religioso ou político ([[WP:SPAM|detalhes]])
  • Ofensivo / baixo calão
  • Disparate / incompreensível / sem contexto
Halfak triaged this task as Medium priority.Mar 23 2020, 4:49 PM
Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

Some follow up questions:

  1. How many of the above categories are we targeting?
  2. Do we require the model to classify the probable of quality of an article from the draft?
  3. Are we only trying to weed out vandalism and other unwanted content through this model?

Clearly defining the outcomes would help accelerate the task

@GoEThe @Halfak

@Chtnnh, it looks like we want to target a few of these deletion reasons. Essentially, we want to product vandalism, spam, and other types of clear nonsense. First, we'll need to find the deletions and try to associate the deletion *reasons* from the log of deleted articles.

Understood. That will be our training and testing data and then upon satisfactory performance we can try deploying the model, am I right?

Sounds like a plan.

From https://quarry.wmflabs.org/query/43261 we can see that we're able to track about 19k deleted articles from the past year. About 4.2k of them have a speedy deletion reason (Has "WP:ER#"). 1582 were deleted with ER20 ("unsuitable") and 770 were deleted with ER6 ("spam"). Do you think it is safe for us to assume that any deletion without either ER20 or ER6 is probably not spam or vandalism?

@Chtnnh, grab the text dataset from /home/halfak/projects/draftquality/datasets/ptwiki.draft_quality.balanced_3k.with_text.json.bz2

From https://quarry.wmflabs.org/query/43261 we can see that we're able to track about 19k deleted articles from the past year. About 4.2k of them have a speedy deletion reason (Has "WP:ER#"). 1582 were deleted with ER20 ("unsuitable") and 770 were deleted with ER6 ("spam"). Do you think it is safe for us to assume that any deletion without either ER20 or ER6 is probably not spam or vandalism?

If I'm not mistaken, there are also ~700 SPAM articles which were deleted under the following reason mentioned by Goethe in T246667#5940917:

Spam / propaganda / proselitismo religioso ou político

@GoEThe : I see you've installed the version of the script I mentioned at T246667#6079484. Did you have the chance to test it on Special:Newpages? Is it good enough for us to publicize it for other users?

Apparently, you are the only one who installed it so far:
https://global-search.toolforge.org/?q=%22DraftAndArticleQuality.js%22&namespaces=2%2C8
while there are many other users testing the original version:
https://pt.wikipedia.org/w/index.php?sort=relevance&search=insource%3A%22EpochFail%2FArticleQuality.js%22&title=Especial:Pesquisar&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns2=1&ns8=1

@He7d3r , I tried it briefly. I think the script works fine. The symbols are not immediately intuitive, but I guess for this phase that is not what is important. I think we can publicize it yes, and see what they think about the new draftquality model.

@GoEThe: in case you have any suggestions on better images for this purpose, we can try changing them. @Halfak suggested the https://commons.wikimedia.org/wiki/Category:OOUI_icons as a good source of icons we could use.

I was thinking... Would it be better to print the ORES prediction on the page somewhere so that it is easier for people to copy and paste that for misclassification reports?