Page MenuHomePhabricator

Build article quality model for Ukrainian Wikipedia
Open, MediumPublic

Description

How do Wikipedians label articles by their quality level? What levels are there and what processes do they follow when labeling articles for quality?
There is a grading scheme, which comprises of such levels: IV → III → II → I → Good article → Featured article/list. Afaik, all wikiprojects use this same scheme.
The process for an article to achieve Good or Featured status is the same for any topic. An article has to be nominated and voted for.
All other articles have a chance to be graded only if i) they are in scope of a wikiproject, ii) the wikiproject is alive and works on grading. 124К articles have a template, corresponding to en:Template:WPBannerMeta, transcluded on the talk page; not all of them are graded.
Template:Stub and its derivatives, used in the main namespace, roughly mean "this article is not complete"; they can be placed not only in level IV article but also level III and even II if an article is generally lacking information.

How do Infoboxes work? Are they used like on English Wikipedia?
Editors are encouraged to use infoboxes. The most used ones tend to take a lot of info from Wikidata (e.g. many biographies, especially translated from other languages, use Infobox person with no parameters at all). Infoboxes that are used less often or are specific to narrow topics with data not present in Wikidata, are more likely to be filled in when used in an article.

Are there "citation needed" templates? How do they work?
Template:Citation needed would be most used among inline dispute templates and Template:Unreferenced and Template:Refimprove -- most used among non-inline templates.

Event Timeline

Ata created this task.Apr 30 2020, 8:22 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptApr 30 2020, 8:22 PM
Restricted Application added subscribers: Liuxinyu970226, Base, Aklapper. · View Herald Transcript
Halfak added a subscriber: Halfak.Apr 30 2020, 8:52 PM

Thanks @Ata for filing this. I have a few follow-up questions

What do the infobox template names look like? We'll need to try to identify them from the text of the article. In English Wikipedia, all Infobox templates start with "Infobox". Is it the same in Ukrainian Wikipedia?

Does Ukrainian Wikipedia use any "Main article" templates -- e.g. when summarizing a full article in the section of another article?

I see templates that look like this on talk pages: {{Стаття проекту Військова техніка|важливість=найвища|рівень=ДС}}

Are all WikiProject templates prefixed with "Стаття проекту"? Will all of the templates use a quality level parameter named "рівень"?

Ata added a comment.EditedApr 30 2020, 10:10 PM

What do the infobox template names look like? … In English Wikipedia, all Infobox templates start with "Infobox". Is it the same in Ukrainian Wikipedia?

No. Initially there was a prefix Картка: for Infoboxes, but it was deprecated in some templates in order to shorten the titles.
I see several ways of identifying the infoboxes (from a user's side):

I see no way of judging if template is an infobox just by it's name, there has to be a comparison with the list of template names believed to be infoboxes.

Does Ukrainian Wikipedia use any "Main article" templates -- e.g. when summarizing a full article in the section of another article?

Yes, there are https://uk.wikipedia.org/wiki/template:Main and template:Докладніше (the latter used both for main ns and in categories).

Are all WikiProject templates prefixed with "Стаття проекту"?

Not all of them, there are some that start with "Вікіпроект" and "Проект". Right now official spelling of the word is in transition from "проект" то "проєкт", and currently both spellings are used in ukwiki without a system, which means there are redirects for templates, too.

Will all of the templates use a quality level parameter named "рівень"?

Some use "class" as a synonym. I haven't encountered other names for this parameter.

Chtnnh added a subscriber: Chtnnh.May 1 2020, 3:19 PM
Halfak triaged this task as Medium priority.May 4 2020, 4:59 PM
Halfak moved this task from Untriaged to New development on the Machine Learning Platform board.

Hello @Ata! So the approach I am thinking of here is to solve this task in three steps:

  1. Build extractor for ukwiki to help get training data
  2. Build feature list to train model on
  3. Build model and iterate until satisfactory

Need your help with step 1, to build the extractor we need to know what a template looks like for an article whose quality has been assessed.

Ata added a comment.May 26 2020, 5:38 PM

@Chtnnh Do you mean WikiProject template? one example? Here is an assessed article with WikiProject template on its talk page.
Or is it about Infobox template?

It looks like any template starting with "Стаття проекту", "Вікіпроект", or "Проект" could be a quality labeling template. You could scan for templates with those prefixes and then look for a "рівень" or "class" parameter containing one of the valid quality labels (IV, III, II, I, ДС, BC). @Chtnnh, that should help get you started on the extractor.

@Ata, is there anything important we might miss with this strategy? E.g. are there synonyms for the quality labels or at there other names for the parameters we should look for?

Ata added a comment.EditedMay 27 2020, 6:09 PM

Yes, almost all templates starting with Стаття проекту, Стаття проєкту, Вікіпроект, Вікіпроєкт, Проект or Проєкт (spelling differences) are labeling templates.
Of those, that are not, some are used in ns:0, ns:4, ns:14, but not ns:1 (I guess they won't interfere here), and a few are in ns:1 and belong to old Wikipedia Education Program (namely templates Проект:ВікіСтудія, Проект:КНУ, Проект:КПІ, Проект:ЛНМА, Проект:НАУ, Проект:НДУ імені Миколи Гоголя, Проект:Переяслав).

No other names for parameters, just "рівень" or "class".

There are synonyms for the quality labels (first column just for explanation):

(featured article)ВСвсВибрана статтявибрана стаття
(featured list)ВСПвспВибраний список
(good article)ДСдсДобра статтядобра стаття
(I)I1
(II)II2
(III)III3
(IV)IV4Stubstub
(list)Списоксписок

Fantastic. Thanks for the notes @Ata! This is an amazing reference.

Thank you so much for your support @Ata. We have constructed an initial version of the extractor and are going to have a run on it hopefully by the end of today.

We will reach out to you here when we need anything else. Lets get this task resolved as soon as possible! 😃

When we run the extractor and count the number of instances of each class, we get the following output:

  677 I
 3692 II
23001 III
 9848 IV
  150 ВС
  359 ДС

Seems like there are too few instances of class I, ДС and ВС. Any reason you can think of why this is the case? With your help we can help figure out if the extractor needs debugging or this is the number of articles with each template class.

Ata added a comment.EditedMay 30 2020, 7:51 PM

There are 223 Featured Articles and 739 Good Articles in ukwiki as of today, and not all of them have a project template on a talk page. I cannot say right away whether the numbers you got are exactly true but they do seem plausible. (The lack of article class recognition is a known issue in the wiki.)

Correct me if I'm wrong, but ВС is featured article class right? If so, then the numbers maybe worth trusting and we can go ahead with building the feature lists for the model.

Ata added a comment.May 30 2020, 8:31 PM

Yes, ВС = Featured Article, ДС = Good Article.

Chtnnh added a comment.Jun 1 2020, 3:28 PM

Great, thanks @Ata
We are going to move ahead with these numbers to build an initial iteration of the model and then get your feedback on that.

I am starting work on the feature list today. I will let you know if I need anything!

Hey @Chtnnh! Any progress we can build on?

I see no way of judging if template is an infobox just by it's name, there has to be a comparison with the list of template names believed to be infoboxes.

@Ata , can you give us an idea of what this list would look like? We need to find all the infoboxes in a page.

Hey @Chtnnh! Any progress we can build on?

No @Halfak nothing yet 😢

@Chtnnh, as we'd discussed, there are some good heuristics for this. E.g. by counting the number of templates in the first section and using that as a feature. We don't need to have an infobox feature at all in order to have a model that words. It'd just work better with that feature. We should be able to move forward.

Initial model is merged. https://github.com/wikimedia/articlequality/pull/140

We're not getting very good fitness, but this will be good for a test. We can get this in a deployment soon and @Chtnnh can help @Ata test the model and give feedback.

Hello @Ata! Glad to inform you that the model has been deployed to beta (https://ores-beta.wmflabs.org/v3/scores/ukwiki/)

The community can play around with the model and give us their feedback. We are working on deploying the model to production as soon as possible.

One example would be the revision: https://uk.wikipedia.org/w/index.php?oldid=29010411 which is of featured article quality class has the following prediction: https://ores-beta.wmflabs.org/v3/scores/ukwiki/29010411/articlequality

The same format applies to any revision you want to test.

(Please note that \u0412\u0421 corresponds to Featured article quality class and \u0414\u0421 corresponds to Good article quality class)

Hoping to hear from you!

Ata added a comment.Jul 24 2020, 11:29 AM

All right! I'm taking this to the village pump for people to see.

Great! Feel free to share any feedback here or in IRC (chtnnh on Freenode)
Also, we have a date for the production deployment, sometime around later next week.

Ata added a comment.EditedJul 28 2020, 8:34 AM

Users Mr.Rosewater and BogdanShevchenko pointed to these articles:

and are asking whether it is ok for probabilities to go down like this.
User Rar also asked if it is possible to list levels in order, either descending or ascending. Since IV=stub, current order is a bit confusing.

User Rar also asked if it is possible to list levels in order, either descending or ascending. Since IV=stub, current order is a bit confusing.

Sorry, it is actually not possible as JSON does not preserve order and lists them out in an alphabetical order.

and are asking whether it is ok for probabilities to go down like this.

Let me have a look and ask @Halfak as I am not sure about this myself.

Ata added a comment.Sep 14 2020, 4:05 PM

Can I help with something at this stage? I have not received other feedback from users since last comment.

I just checked a couple of those articles and the rises and falls in predicted quality tend to correspond with additions and removals of content. E.g., It looks like Комптонівське розсіювання goes up and back down in quality around substantial content deletions.

We should be able to list the levels in order. Where are they listed out of order?