Page MenuHomePhabricator

Create new labeling campaign for Basque Wikipedia articlequality
Open, NormalPublic

Description

Goal is to increase the number of observations used to train the model. This should, in theory, boost model confidence.

We probably want to use the current ORES model to generate a new stratified sample. If we can join ores_classification to page/revision we can probably get that sample together quickly for pages that have been edited recently.

Event Timeline

Halfak created this task.Feb 5 2019, 9:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2019, 9:11 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptFeb 5 2019, 9:11 PM
Halfak triaged this task as Normal priority.Feb 5 2019, 9:11 PM
Halfak moved this task from Untriaged to Maintenance/cleanup on the Scoring-platform-team board.
Halfak added a comment.Feb 5 2019, 9:22 PM

Right now, we have about 50 observations per class. We should aim to supplement with 50 new observations per class. That will mean that there will be 300 new articles to label.

It would be interesting of new articles are not municipalities of France, since we have a lot and they are all rated as C. Nearly all of them use the template Frantziako udalerri infotaula INSEE

As lists now are not counted as articles, this will also affect the quality. Maybe items in the pool now that are lists could be excluded from the final observations.

Is there a way we could tell from the title of an article if it is a municipality of France? Similarly, can we tell which articles are a list by the title or will we need to look for a template?

Not by the title. Well... yes, you can download a list from petscan, but there's nothing but the name of the municipality. You can track categories or the use of the template.

About lists: yes, all lists are in the namespace Zerrenda:

2019 ots. 5 11:16 PM erabiltzaileak hau idatzi du (Halfak <no-reply@phabricator.wikimedia.org>):
Halfak added a comment.

Is there a way we could tell from the title of an article if it is a municipality of France? Similarly, can we tell which articles are a list by the title or will we need to look for a template?

TASK DETAIL
https://phabricator.wikimedia.org/T215351

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Halfak
Cc: Theklan, Halfak, Aklapper, Vacio, Capankajsmilyo, Fz-29, notconfusing, Ricordisamoa, Alchimista, He7d3r

Halfak added a comment.Jun 4 2019, 8:47 PM

I finally have made some progress here and I have a good sample worth labeling. What should I name the labeling campaign? I'm imagining something like "Article quality version 2"

Halfak added a comment.Jun 5 2019, 2:46 PM

I was able to find 172 pages that span the predicted quality spectrum. I think labeling them will boost the model fitness substantially. Regardless, that's the most observations we can manage without getting a lot more clever.

If this doesn't work, I think we should consider building a dataset of pages that ORES tends to get wrong. We can add them as targeted labels to try and teach the model better behavior. But for now, let's try this sample. I just need a name for the labeling campaign.

That sounds great!
I think that "Article quality version 2" is not bad... currently I don't think we need something more catchy, as this is a private task!

Halfak added a comment.Jun 5 2019, 3:45 PM

See https://labels.wmflabs.org/ui/euwiki/. Currently the campaign has a English Langauge name. We can rename it when I get a good translation.

I have opened it and the first one is a diff of one letter in an article. What should I do with that?

Halfak added a comment.Jun 5 2019, 3:49 PM

Looks like I made a mistake. One sec.

Halfak added a comment.Jun 5 2019, 3:51 PM

OK should be fixed.

could I go ahead with this. I have some time allocated for it in the next days. Thanks!

Yes. It looks to be ready to go. Let me know if you see any other issues.