Build draft quality model for ptwikipedia
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	GoEThe
	Mar 2 2020, 2:38 PM

Description

What tools do editors use to find and delete the worst types of new pages (spam, vandalism, and personal attacks)?

Huggle is one of the main tools that people use on ptwikipedia for RC patrolling. Also RTRC, FastButtons (https://pt.wikipedia.org/wiki/MediaWiki:Gadget-fastbuttons.js). There may be others.

Is there a common way that deletion reasons are recorded in the log? For example, in English Wikipedia, they include names for each deletion reason in the log (e.g "WP:CSD#G11" means the article was deleted as "spam").

Yes, Deletereason-dropdown is commonly used: https://pt.wikipedia.org/wiki/MediaWiki:Deletereason-dropdown.
This includes the formal reasons for speedy deletion: https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Elimina%C3%A7%C3%A3o_r%C3%A1pida#Regras_formais

These include a one letter prefix indicating namespace (with some execptions): A (Article namespace), C (Category mainspace), D (Discussion or talk mainspace), U (User namespace), P (Predefinition or template namespace), R (redirect) and G (technical deletion). There are also legacy rules that may still be used in some cases, these are indicated with "ER" + the number of the rule.

The numbers indicate the reason. I translate here the relevant ones:

A1 - Article with gibberish title
A2 - Article with no context
A3 - Article with no content
A4 - Article with no statement regarding notability (people, animals, organizations, web content, events)
A5 - Article with no statement regarding notability (music, books)
A6 - Duplicated content

ER3 - Absurd titles, content moved to another article
ER4 - Absurd titles in an encyclpedia, content moved to another article
ER5 - Recurring content (same content as previously deleted)
ER6 - Spam, publicity, or pamphlet
ER13 - Copyright violations
ER20 - Unsuitable content

Related Objects
Search...

Status	Assigned	Task
Resolved	Halfak	T250536 Mid-April 2020 ORES deployment
Resolved	Chtnnh	T247847 Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis
Resolved	Chtnnh	T250809 Review model performance for ptwiki 'articlequality' and 'draftquality'
Resolved	Chtnnh	T246663 Build article quality model for ptwikipedia
Resolved	Chtnnh	T246667 Build draft quality model for ptwikipedia
Resolved	Chtnnh	T251905 Write report about misclassification reports

Event Timeline

GoEThe created this task.Mar 2 2020, 2:38 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptMar 2 2020, 2:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We're mostly looking to catch "spam", "vandalism", and other types of problematic drafts that are obvious from the content itself. In other words, you don't need any subject matter expertise or external references to figure out that something is wrong. This is due to limitations in what the model can be expected to figure out.

So copyright is hard because the only way to know for sure is to find where it was copied from.

Would we be likely to find vandalism under ER20 "Unsuitable content"? It seems like ER6 is the only item that covers spam and advertising. Is that right?

Yes, you would tag vandalism with ER20. ER20 may also be used for spam and advertising.

There are some additional standard deletion reasons that can be used for spam or vandalism, available in the dropdown menu:

Spam / propaganda / proselitismo religioso ou político ([[WP:SPAM|detalhes]])
Ofensivo / baixo calão
Disparate / incompreensível / sem contexto

Halfak triaged this task as Medium priority.Mar 23 2020, 4:49 PM

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

Some follow up questions:

How many of the above categories are we targeting?
Do we require the model to classify the probable of quality of an article from the draft?
Are we only trying to weed out vandalism and other unwanted content through this model?

Clearly defining the outcomes would help accelerate the task

@GoEThe @Halfak

@Chtnnh, it looks like we want to target a few of these deletion reasons. Essentially, we want to product vandalism, spam, and other types of clear nonsense. First, we'll need to find the deletions and try to associate the deletion *reasons* from the log of deleted articles.

Understood. That will be our training and testing data and then upon satisfactory performance we can try deploying the model, am I right?

Sounds like a plan.

From https://quarry.wmflabs.org/query/43261 we can see that we're able to track about 19k deleted articles from the past year. About 4.2k of them have a speedy deletion reason (Has "WP:ER#"). 1582 were deleted with ER20 ("unsuitable") and 770 were deleted with ER6 ("spam"). Do you think it is safe for us to assume that any deletion without either ER20 or ER6 is probably not spam or vandalism?

He7d3r subscribed.Apr 1 2020, 1:53 PM

@Chtnnh, grab the text dataset from /home/halfak/projects/draftquality/datasets/ptwiki.draft_quality.balanced_3k.with_text.json.bz2

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Apr 15 2020, 7:20 PM

https://github.com/wikimedia/draftquality/pull/36

Model has been built 😄

Halfak moved this task from Parked to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Apr 17 2020, 7:40 PM

Halfak assigned this task to Chtnnh.Apr 17 2020, 7:49 PM

Halfak added a parent task: T250536: Mid-April 2020 ORES deployment.Apr 17 2020, 9:43 PM

In T246667#5999648, @Halfak wrote:

From https://quarry.wmflabs.org/query/43261 we can see that we're able to track about 19k deleted articles from the past year. About 4.2k of them have a speedy deletion reason (Has "WP:ER#"). 1582 were deleted with ER20 ("unsuitable") and 770 were deleted with ER6 ("spam"). Do you think it is safe for us to assume that any deletion without either ER20 or ER6 is probably not spam or vandalism?

If I'm not mistaken, there are also ~700 SPAM articles which were deleted under the following reason mentioned by Goethe in T246667#5940917:

Spam / propaganda / proselitismo religioso ou político

Halfak added a parent task: T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.Apr 21 2020, 2:44 PM

Here is an updated version of @Halfak 's script, css, and loader code:

With the script enabled, the page Special:NewPages (as suggested at T250809#6075300) looks like this for now:

draftquality-new-pages.png (768×1 px, 288 KB)

He7d3r mentioned this in T250809: Review model performance for ptwiki 'articlequality' and 'draftquality'.Apr 23 2020, 1:46 PM

Halfak moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Apr 27 2020, 4:35 PM

He7d3r mentioned this in T251171: Add `words_to_watch` to articlequality and draftquality models in ptwiki.Apr 27 2020, 9:42 PM

Chtnnh added a parent task: T247847: Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis.May 5 2020, 12:55 PM

@GoEThe : I see you've installed the version of the script I mentioned at T246667#6079484. Did you have the chance to test it on Special:Newpages? Is it good enough for us to publicize it for other users?

Apparently, you are the only one who installed it so far:
https://global-search.toolforge.org/?q=%22DraftAndArticleQuality.js%22&namespaces=2%2C8
while there are many other users testing the original version:
https://pt.wikipedia.org/w/index.php?sort=relevance&search=insource%3A%22EpochFail%2FArticleQuality.js%22&title=Especial:Pesquisar&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns2=1&ns8=1

@He7d3r , I tried it briefly. I think the script works fine. The symbols are not immediately intuitive, but I guess for this phase that is not what is important. I think we can publicize it yes, and see what they think about the new draftquality model.

@GoEThe: in case you have any suggestions on better images for this purpose, we can try changing them. @Halfak suggested the https://commons.wikimedia.org/wiki/Category:OOUI_icons as a good source of icons we could use.

@He7d3r , perhaps https://commons.wikimedia.org/wiki/File:OOjs_UI_icon_alert-warning.svg and https://commons.wikimedia.org/wiki/File:OOjs_UI_icon_block-destructive.svg ?

I was thinking... Would it be better to print the ORES prediction on the page somewhere so that it is easier for people to copy and paste that for misclassification reports?

Halfak closed this task as Resolved.Jun 22 2020, 4:36 PM

Build draft quality model for ptwikipediaClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Build draft quality model for ptwikipedia
Closed, ResolvedPublic
Actions

Related Objects
Search...