Page MenuHomePhabricator

Build article quality model for Dutch Wikipedia
Open, MediumPublic

Description

How do Wikipedians label articles by their quality level?

What levels are there and what processes do they follow when labeling articles for quality?

How do InfoBoxes work? Are they used like on English Wikipedia?

Are there "citation needed" templates? How do they work?

Event Timeline

Halfak created this task.May 19 2019, 9:24 AM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptMay 19 2019, 9:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Halfak added a subscriber: RonnieV.May 19 2019, 9:40 AM

I looked into this with @RonnieV and we weren't able to find any documentation about an article quality scale on nlwiki. I think defining article quality in nlwiki terms will be a good first step to building this model. Alternative, one could translate the English Wikipedia article quality scale for use on nlwiki. See https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades

Halfak added a subscriber: Ciell.May 19 2019, 9:41 AM

@Ciell, maybe you have some ideas for how we could get started here. Or if an article quality prediction model would be helpful at all.

Harej triaged this task as Lowest priority.Jun 4 2019, 9:22 PM
Harej raised the priority of this task from Lowest to Needs Triage.
Harej triaged this task as Lowest priority.Jun 4 2019, 9:31 PM
Ciell added a comment.Jun 5 2019, 4:54 PM

In 2006 (I know, 13 years ago already) we voted on this Quality scale and the idea was declined by the Dutch community. (https://nl.wikipedia.org/wiki/Wikipedia:Stemlokaal/Kwalificaties_bij_artikelen) A working group was vormed in 2007, but never got far. It does explain the different types of artcles we determine in the main namespace though. (https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject/Kwaliteitsverbetering & https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject/Projectgroep/Kwaliteitsschaal).

We have beg (stubs), Etalage (excellent) and everything in between (which aren't graded).

Hey! Picking this back up. @Ciell and I determined via chat that we thing a classifier that can put articles into three categories would be useful: Stubs, Featured Articles, and everything in-between.

  1. How could I gather a sample of Stub articles? I'd like to grab specific revisions of those articles that we know to be stubs.
  2. Same for Excellent Articles.
  3. How could I gather a sample of articles -- or versions of articles -- that are between Stub and Excellent? E.g., could we look for the version of the article where the Stub assessment is removed? Are there articles that we know are not Excellent quality but might represent the wide range between stub and excellent?

Looks like https://nl.wikipedia.org/wiki/Wikipedia:Uitgelicht lists out some Featured articles

Maybe https://nl.wikipedia.org/wiki/Wikipedia:Etalage and https://nl.wikipedia.org/wiki/Wikipedia:Beginnetje are better places to start. I can't seem to find how the articles are tagged though.

Aha! https://nl.wikipedia.org/wiki/Categorie:Wikipedia:Beginnetje_autosport seems to list out autosport stubs. There's a bunch of different stub categories here.

Hey @Halfak and @Ciell,

A stub is called 'beginnetje' in Dutch. A lot of these articles are labelled with 'Sjabloon:Beginnetje' and are in https://nl.wikipedia.org/wiki/Speciaal:VerwijzingenNaarHier/Sjabloon:Beginnetje .
Articles that are seem to be valuable and good are labelled with 'Sjabloon:Etalage'. These can be found at https://nl.wikipedia.org/wiki/Speciaal:VerwijzingenNaarHier/Sjabloon:Etalage .
We can try to identify some improved articles as well.

Ciell added a comment.May 16 2020, 9:12 PM

@RonnieV I explained to @Halfak that indeed we have stubs in the Dutch Wikipedia, but an article is approved as such even when it only has three facts (official rule), and preferrably one source and 2/3 lines of text (unofficial rule).

I am definitely in for training the system for an extra class. I once tried such a thing with the "Rough Diamants" (we can easily check the articles on that page, see if they are still close-to-FA): also interesting might be the (archived) Wikipedia:Lezenswaardig and the (inactive) Project Kwaliteitsverbetering.

If you search for articles where the Stub assessment is removed: this project improved a 1000 stubs into 'normal' articles (after fifty improved (weggewerkt) articles the list is deleted, so you'll need to look into the page history).

Interesting! Looking at the points at which stub tags are removed could give us an "improved" class.

So we'd have Stub --> Improved --> Featured.

In theory, we can track removals of the Stub assessment to get this.

I saw a ping from @Ciell that nlwikipedians approve moving forward. I thought I should document it here.

Hey folks! We had a chat about modeling during the Machine Learning Platform's sync meeting today. Essentially, it seems there is interest in developing intermediary categories (between "Beginnetje" and "Etalage"). In the meantime, we can probably probably build a useful model using the 2 categories and the fabricated third "inbetween" category. Then we could use the predictions from that model to help stratify the "inbetween" category for labeling work.

So, I think the next step here is to build an extractor that can gather examples of "Beginnetje" and "Etalage". We'll also want to gather examples of articles moving out of "Beginnetje" and we can call them "inbetween" or maybe some dutch translation for now.

Regarding the extractor, I can see that Etalage articles have templates in the *article* that look like {{Etalage|42311910|2014|10|20}}. It looks like Beginnetje articles have a template that looks like {{Beginnetje|wetenschap & technologie|2018|03|30}}. so I think we'll want to scan for the appearance of those templates.

Hey folks, Thanks for the nice meeting and your information.
The template name is 'Etalage', 42311910 is the revision number that got approved and then follow year, month and day of the decision to recognise this article as 'Etalage'.
The template name is 'Beginnetje'. It is followed by a category (1 out of a fixed list of 46), and then follow year, month and day of the decision.

We will have a look at some articles that might qualifiy for a level below 'Etalage'. We think about articles that just didn't win one of the writing contests.

Do you have handy tools to find the removal of the 'Beginnetje'-template (or maybe the 'etalage'-template from the history of articles?

We have a strategy for scanning the history of articles and looking for the inclusion and removal of templates. Right now I think we have what we need to scan for the introduction these templates. We'll need to write some new code to find the removal of the {{Beginnetje}} template but that shouldn't be too complicated.

We will have a look at some articles that might qualifiy for a level below 'Etalage'. We think about articles that just didn't win one of the writing contests.

I have an idea. How about we use this scale:

  1. Beginnetje -- Tagged with {{Beginnetje|...}}
  2. No longer Beginnetje -- Revision where {{Beginnetje|...}} was removed.
  3. Almost Etalage -- Examples from the writing contest that look good but aren't tagged with {{Etalage|...}}
  4. Etalage -- Tagged with {{Etalage|...}}

There could be some overlap or gaps between 2 and 3, but I bet the model might make better sense of the range.

I bet there is a big gap between 2 and 3, and most articles will be in there. But if ORES could help identify articles which belong in one of these four categories, I'd be happy if the remainder is in that gap. ORES could than, later on, help categorising the articles from the in between group and might identify candidates for the four categories.

Ciell and I will work on a list of articles in category 3.

Ciell added a comment.May 30 2020, 3:13 PM

Ha Ronnie,

Prima om dit samen op te pakken.
Jij hebt blijkbaar nog met Aaron gesproken over hoe en wat voor de 4e
categorie: zullen we samen een momentje plannen om hier een start mee te
maken?

Vriendelijke groet,
Ciell

Op wo 27 mei 2020 om 22:31 schreef RonnieV <
no-reply@phabricator.wikimedia.org>:

RonnieV added a comment. View Task
https://phabricator.wikimedia.org/T223782

I bet there is a big gap between 2 and 3, and most articles will be in
there. But if ORES could help identify articles which belong in one of
these four categories, I'd be happy if the remainder is in that gap. ORES
could than, later on, help categorising the articles from the in between
group and might identify candidates for the four categories.

Ciell and I will work on a list of articles in category 3.

*TASK DETAIL*
https://phabricator.wikimedia.org/T223782

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *RonnieV
*Cc: *Encycloon, Ciell, RonnieV, Aklapper, Halfak, Xinbenlv, Vacio,
Capankajsmilyo, Fz-29, notconfusing, Ricordisamoa, Alchimista, He7d3r

Ciell added a comment.Jun 6 2020, 7:00 PM

Hi all,

We propose to have 5 quality levels.
I created a Wikipedia page on the Dutch Wikipedia, so Dutch Wikipedians can follow and comment.

https://nl.wikipedia.org/wiki/Wikipedia:ORES/Article_quality

Halfak added a comment.Jun 9 2020, 3:22 PM

Thanks for the notes! I've added this to our sync meeting today.

Halfak raised the priority of this task from Lowest to Medium.Jun 23 2020, 4:27 PM