Page MenuHomePhabricator

Build article quality model for Dutch Wikipedia
Open, MediumPublic

Description

How do Wikipedians label articles by their quality level?

What levels are there and what processes do they follow when labeling articles for quality?

How do InfoBoxes work? Are they used like on English Wikipedia?

Are there "citation needed" templates? How do they work?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I just started working with @Psingh07's dataset.

$ cat nlwiki-20201101-E_and_D.json | wc -l
39492
$ cat nlwiki-20201101-E_and_D.json | json2tsv label | sort | uniq -c
    551 D
  38941 E

~40k labels, mostly "E" -- adding the stub template.

$ cat nlwiki-20201101-E_and_D.json | grep -P '"username": ".*[bB]ot.*' | json2tsv username | sort | uniq -c
   1237 BotMultichill
     38 BotteHarry
      1 Erwin85TBot
  36520 LymaBot
      4 Robotje
    306 RobotMichiel1972
    167 RomaineBot

We've got some very active bots in this dataset.

 34 2005
 11 2006
 24 2007
  3 2008
 36 2009
149 2010
253 2011
 39 2012
 45 2013
394 2014
 59 2015
 42 2016
 32 2017
 41 2018
 31 2019
 26 2020

There we some bursts of human editor templates in 2014.

$ cat nlwiki-20201101-E_and_D.json | grep -P '"username": ".*[bB]ot.*' | json2tsv timestamp wp10 | cut -d- -f1 | sort | uniq -c
   1241 2009
    377 2010
     65 2011
      1 2012
  36581 2014
      4 2015
      1 2016
      1 2017
      1 2019
      1 2020

Looks like the bots are mostly inactive since 2014.

I extended the query described in https://quarry.wmflabs.org/query/47900

$ cat nlwiki-20210116-A_B_and_C.json | json2tsv wp10 | sort | uniq -c 
    364 A
    105 B
    500 C

It looks like we'll need to get more clever to find more "B" class articles. No matter, we can probably at least start experimenting with this.

$ cat datasets/nlwiki-20201101.balanced_sample.json | json2tsv wp10 | sort | uniq -c
    100 A
    100 B
    100 C
    100 D
    100 E

Let's start with a balanced dataset.

You can see the data I generated here: https://raw.githubusercontent.com/wikimedia/nlwiki_articlequality/master/datasets/nlwiki-20201101.balanced_sample.json

Next steps, we'll want to share this sample with dutch Wikipedians to make sure it reflects their expectations and we'll also want to try building a model with this dataset.

Hi @Halfak,, @Psingh07 and others,

Thanks for these big steps forward. I looked at the generated dataset. The E-category is swamped with items from 'Beginnetje biologie'. I know, that is by far the biggest group amongst our beginnetjes. But almost al these articles are botimports and considered as beginnetje merely for being created by a bot. Would it be possible to get a new selection for the E-group whilst eliminating 'Beginnetje biologie'? That would make this group far more interesting to look at.

I also came across two articles in user space 2. This can be rescued articles from which the removal of the beginnetje-template is just done because it was moved out of the main space and not having to do anything with the quality of the article at the specific moment.

Instructions for the people who are going to look at these specific revisions, which might be far from the current state of the article.

Hi, I second that.

Referring to our notes from December 17th, please exclude bots from the following users from the dumps:

As discussed we plan to add username for analysis, to identify bots making the edits
Excluding stubs ('beginnetjes') in 5.json by username, there were ~3-4 bots creating articles.

  • Joopwikibot
  • RobotMichiel1972
  • Kvdrgeus

https://etherpad.wikimedia.org/p/dutch_articlequality

Thanks for your time and attention today

I'm looking at the new dataset (2021-1-21) (first line: {"timestamp": "2018-12-04T09:52:55Z", "rev_id": 52732712, "wp10": "D", "page_title": "Achthophora chabacana", "username": "Vitalfranz", "page_namespace": 0}

I see versions in the B-set that are redirects, like rev: 39297089, 17071877, I see pages outside the main space, like rev:50356068, user (talk) pages, like rev:48867116, 56505834, 58020315, 55627603
Length of the C-class articles (3-5kB) is fine to start with, lets get the system working. In the final model, the length should not be a major value to rate articles.
Links from previous versions of https://nl.wikipedia.org/w/index.php?title=Wikipedia:Etalage/Aanmelding_kandidaten/Aanmeldingen to Gebruiker... or User... should all be excluded
E-class still has to many biology-stubs. Would it be possible to exclude these completely {{Beginnetje|biologie , like in https://nl.wikipedia.org/wiki/Oberea_yaoshana

Would it be hard to exclude all redirects, all pages outside namespace 0 and all revisions (E-class and if possible D-class (removal of)) with {{Beginnetje|biologie for a new set?

Excluding redirects can be reached by requiring a minimal length of (say) 250 or 1000B. For A and B, 1000 should be no problem, for E this might be too long.

I've updated https://quarry.wmflabs.org/query/47900 to exclude redirects and limit results to main namespace pages.

I've updated our bot filter to filter *out* bots rather than limiting the dataset to bot activities. Woops.

@Psingh07 and I collaborated on a script that gathers failed "A" class nominations. We were able to gather an additional 257 examples. So by adding that to our dataset, I can boost the size 3.5x.

The results of all of this work can be seen in the repo along with a new balanced dataset: https://github.com/wikimedia/nlwiki_articlequality/blob/master/datasets/nlwiki-20201101.balanced_sample.json

I just created an initial model with basic features. See https://github.com/wikimedia/articlequality/pull/162

It looks like we're getting very decent fitness. I think we can do a lot more with targeting features of articles. E.g. does nlwiki use infoboxes? Are there cleanup templates we can look for that would suggest quality issues? Anyway, we can leave that for follow-up work.

We'll still need to get this deployed before it can be used. I am talking to @calbon next week so that might let us put a date on when this will be available for experimentation.

It looks like we're getting very decent fitness. I think we can do a lot more with targeting features of articles. E.g. does nlwiki use infoboxes? Are there cleanup templates we can look for that would suggest quality issues? Anyway, we can leave that for follow-up work.

We're not fans of adding a lot of cleanup templates.
Most common are Wikify (to mark an entiry article needs clean up, but no need to remove it), Twijfel (there is doubt about the article) and Bron? (citation needed), and Wanneer? (when did the claim happen?)
And yes, we do use infoboxes see https://nl.wikipedia.org/wiki/Categorie:Wikipedia:Sjablonen_infobox and subcategories.

Thank you, @Mbch331! Very helpful.

@Psingh07, I think this is a good next step for your work. See https://github.com/wikimedia/articlequality/pull/162/files#diff-ec69525b50983e5104647bfb8bec473b4c8c236d5d167f93a164d4150e1955bf This is the initial feature set. It's very minimal. In comparison, check out the one for English Wikipedia: https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/enwiki.py It will give you an idea of how we can use the templates that Mbch331 references above.

Our pattern will essentially be this:

  1. Work with folks in this thread to learn good ways to track quality-related stuff in articles
  2. Implement as features
  3. Run make nlwiki_models and make nlwiki_tuning_reports as necessary to check for fitness changes.
  4. Submit a new PR with better models. (Goto 1)

The mantel of this task has shifted to new volunteer Paritosh. @Psingh07

Hey folks!
@Halfak and the whole team is very excited and delighted that we were able to get a deployment done for the articlequality model. If you wish to see ORES scored on pages, copy this to your own common.js page following this link. Let us know how it works and if it aligns with your expectations and also if you were happy with the performance or not. We're available to make improvements to the model and to optimize it to your expectations and improve on any shortcomings, feel free to link us to the problem anyone faces and we will do our best to help.

Thank you so much @Psingh07 , this looks great already!

I can see the rating under the article title, and the colours in the page history. Nice!
Would it be possible to display the colours that are shown in the page history in the rating on the article page as well? (similar as on en:)

@Halfak This means we can start testing soon right? Could we maybe exchange ideas on how to set this up on-wiki, so I/we can start preparing for this?

Sure! We can even use local templates. Would you be interested in creating templates with badges/colors you like for the prediction that appears on the top of the page?

Also, it seems like we should translate: "ORES predicted quality". What might that look like in dutch?

I am not great at creating templates. Do you maybe have a link to the code on en-wp?

I think we would translate the header to "Kwaliteitsinschatting door ORES"

Thank you, @Psingh07, it does look great indeed!

I agree with @Ciell on the label name: "Kwaliteitsinschatting door ORES".
I started a template some time ago, at https://nl.wikipedia.org/wiki/Sjabloon:ORES_schaal, but that does not seem to do the job. What extra info, link, .... do we need? It seems to me the template should be called from somewhere and that does not seem to be the case. I guess I was inspired by a template on en.wikipedia, but I do not seem to have stored the source (and can't find it back either). With some helpful hints, I will see if I can fix it.

I updated the language for the tool so that it should show the dutch version now.

Re. templates, you can see what English Wikipedia has here: https://en.wikipedia.org/w/index.php?title=User:EpochFail/Sandbox&oldid=1023709064

You might consider re-using the icons from enwiki in nlwiki. I made a start of that here: https://nl.wikipedia.org/wiki/Gebruiker:EpochFail/Kladblok

I also added an example usage of {{ORES schaal}} to that page so you can see how it works. If you have the ArticleQuality.js user script enabled, it should add a predicted quality level to the link.

Thank you very much for these examples!
I have had the code for the templates restored (thanks, @Ciell). The template works, I do have to fix some links (not just pointing to 'Categorie:A-klasse artikelen' for classes A through E), but to the etalage-artikelen (FA-articles) and Beginnetjes (stubs) I guess I should point this from the Class template.
Do I have to call the Class template somewhere to have the outcome shown together with the ORES rating, or should @Psingh07 do this in his part of the code?

I've added the class template to the user script code. It looks like the output of the script (on the top of the page) looks right now.

Thanks, @Halfak , that did it!
I will redirect the categories later this (long) weekend, or think of another solution (maybe a descriptive page of the rating). As far as I see in the English Wikipedia, the articles are placed in categories based on a manual edit in the talk page using the vital article template. I think the Dutch Wikipedia is not ready (yet?) for a massive manual rating of articles.
It would be great, but I estimate also a heavy action, to rate all articles on a regular basis using ORES. But lets first see what we can do with the current information and run some user testings.

Thank you for the colours Ronnie, I like them!

I would not be a fan of giving the articles a place in the normal category-tree yet, but yesterday I created Categorie:Wikipedia ORES.

We think we could test with (hidden) subcategories in this category, but I would prefer to wait for the results of the first labelling tests to refine and see community response. (should be about 3-4 weeks, max)

I created the translation of the way we compiled the scale: https://nl.wikipedia.org/wiki/Wikipedia:ORES/Kwaliteitsschaal_voor_artikelen

And updated the test-page: https://nl.wikipedia.org/wiki/Wikipedia:Labels/Kwaliteit_van_artikelen. People are invited to add the code to their common.js an share the mistakes in the wild. I shared that we expect to start testing in June.

Would it be possible to exclude certain pages, like for instance our main page? https://nl.wikipedia.org/wiki/Hoofdpagina

Hi @Ciell , I did not mean to create 'normal' categories (although Ecritures created categories like https://nl.wikipedia.org/wiki/Categorie:A-Class_articles for A-C), but pages like the explainatory https://nl.wikipedia.org/wiki/Wikipedia:ORES/Kwaliteitsschaal_voor_artikelen for all five levels. Maybe using some of the suggestions Halfak made at https://nl.wikipedia.org/wiki/Gebruiker:EpochFail/Kladblok (https://nl.wikipedia.org/wiki/Wikipedia:Ruwe_diamanten for B). Having five red links does not feel very satisfying.

For the main page, it consists of comment, a template calls, a category and a second template call. Maybe pages without own content can be excluded?

With this edit A links to Wikipedia:Etalage, B to Wikipedia:Ruwe diamanten, E to Wikipedia:Beginnetje and C and D to Wikipedia:ORES/Kwaliteitsschaal voor artikelen. We can easily change it to other pages, if more relevant pages exist.

Yeah! We can exclude the Main page. I wonder if there is a good way to identify if we're loading the script on the main page in a wiki/language independent way. In the meantime, I'll look into making a special case for nlwiki.

One other thought, do you want it to load on userspace pages? E.g. User:<name>/Sandbox or the equivalent. Anywhere else on the wiki that people draft articles other than mainspace?

Main page seems to be protected on Wikipedia's in surrounding languages (is it in all?) Is protection a reason not to classify the page? I think this is too tight. Permanent protection could be better. [:da:]nnish and [🇩🇪]eutsch|german have hardly any content in the code, just like the Dutch. The {;fr:]rench, [:en:]glish and [:pt:]|Portuguese do have some content in the code of the page. Just going for 'no content' does not seem to be enough, but excluding pages without content might help to save similar pages from being rated. Is it easy (and not too heavy) to head for Wikidata:Q5296? Is it too limited to our ecosystem?

On the Dutch Wikipedia, we don't have a draft namespace (yet), so people can start writing anywhere in their own namespace User:<name>/Kladblok is the Dutch personal sandbox, but User:<name>/Art_Gensler could be a draft too. I do have some pages in User:RonnieV/... which are not intended to be published (nor classified), but it could help people to have their content above D-level before publishing. User:<name> should not be classified, neither anything in Talk_User:-namespace.

Hi there,

Thanks very much for the efforts this far.

As we plan to have people test the outcomes, the question rises when articles get which qualification. See [https://nl.wikipedia.org/wiki/Overleg_Wikipedia:Labels/Kwaliteit_van_artikelen#D_en_C] for some discussion.
https://nl.wikipedia.org/wiki/Gebruiker:EpochFail/ArticleQuality.js seems to make a difference on the integer values 1-5, but I might be misinterpreting this. It does not seem to turn a decimal number into an integer, so this might be just the five levels we agreed upon and not actually be the boundries between the different qualification.
https://meta.wikimedia.org/w/index.php?title=User:EpochFail/ArticleQuality-system.js seems to contain more logistics. I see (from line 259) a calculation of the weightedSum, by summing probas and their respective weights. The weights are given to the function as part of the options and the probas as part of the given score. I guess the options are language specific (could differ between the Dutch, the English and the Portuguese Wikipedia, but not between different pages of a specific language), and the probas are calculated somewhere, based on the content of the page.
Which parts do contribute to a score? Can we find somewhere how the different probas of a specific page are calculated? (E.g. counting the number of references or the number of outgoing links) What are the weights of different options?
Can this information be published somewhere?

I said above that weights would probably be just language specific. I can imagine through that a namespace can make a difference as well. On the Dutch Wikipedia we are not fond of having articles out of the main namespace show up in regular categories. We also do not like mainspace articles pointing towards drafts and other texts outside the main space. I can imagine that articles outside the main space, if we would rate them and these two options do earn points, they might get some justification to compensate for not being able to score on these two options, either by ignoring this scoring option in the calculated weight and in the outcome or by pretending at least one category would be provided.

I realise that black box testing is a valid way of testing (does article A deserve a higher expectation than article B) and is very important in this phase. Code aware testing does detract attention from this part and bring it more to testing whether the code does actually do an accurate counting of (e.g.) the categories, or does a right summation of the different weights and probas. I am pretty sure these latter tests are already performed in the other (14?) Wikipedias using ORES predictions, so should not get to much attention.

Clarification would be highly appreciated.

Sorry for the late response. The holiday weekend in the US (memorial day) had me out of my usual flow.

For a discussion of the numerical values associated with scores, see https://dl.acm.org/doi/pdf/10.1145/3125433.3125475 page 5. Essentially, this measure gives you a finer estimate of the center of the probability distribution across classes. Usually, you should see that the predicted class (the most likely class) aligns well with the "weighted sum" numerical value. This value can help you distinguish between a low C and a high C article.

For the features that go into a score, the best way to view that is to add ?features to a request to ORES. E.g., https://ores.wikimedia.org/v3/scores/nlwiki/123125/articlequality?features This will list out the raw values from which the model learns and makes predictions. As far as weights, that's more difficult to inspect. The underlying model (a GradientBoosting classifier) builds a sort of decision tree like structure that doesn't explicitly assign a weight (or coefficient) to each feature. However, we can inspect the "feature importance" which will give you a sense for how much each feature ultimately affects predictions.

I can imagine that articles outside the main space, if we would rate them and these two options do earn points, they might get some justification to compensate for not being able to score on these two options, either by ignoring this scoring option in the calculated weight and in the outcome or by pretending at least one category would be provided.

I'm not sure I'm following this. Are you suggesting that you'd like different weights for the "weight sum" calculation outside of main space? Or maybe you want predictions to operate differently outside of main space?

For the formal evaluation of the model, you can add ?model_info to the URL for a score to see the fitness statistics. e.g. https://ores.wikimedia.org/v3/scores/nlwiki/123125/articlequality?model_info This is a bit of a strange request because it is also generating a score. If you just want the fitness statistics and other info, you can do something like this: https://ores.wikimedia.org/v3/scores/nlwiki/?models=articlequality&model_info

Re. black box vs. code aware testing, I agree that, for folks who want to dig in, talking about the code can help. But I think it is far more informative to talk about what data we're training on and how that was generated than looking at the internals of the model itself. E.g., C class is strictly defined as an article with a certain number of chars. That's weird. It may work in practice, but we should look out for articles that exist in the area between D and B for issues. That said, the feature importances analysis I talked about earlier can be quite informative -- e.g., when a feature gets an unexpectedly high or low weight, we can assume something strange is happening. I'll get that together soon and post it here.

Halfak, thanks for you elaborated answer. I will dive into the log after a good night of sleep.

A quick reaction for now.
123125 in the second and third link seems to stand for a revision id (the first version of the article about the movie Gia).

When I read the features (and understand them well), an article does get a higher rating when it has categories (feature.revision.category_links). Articles out of the main space (for instance drafts) should (local policy) not be in a normal category. The article will get a better score as soon as categories are added after moving it to the main space. At least something to be aware of.
At the Dutch Wikipedia, there is no bonus in getting an article in as many categories as possible. I would even say: more than five, six categories seems to be a bummer, not a bonus. It would be good to see this somewhere in process.

Here's the importance table. The higher the importance score, the more important the value is to the prediction. It turns out that the count of category links is the least important feature of the set. Overall length of the article, the amount of content with references, and the proportion of content that is referenced are the dominant features.

One thing you'll note here is that many features are expressed as something like feature_1 / feature_2. But you don't see that in the output of ORES when providing ?features as an argument. This is by design. We simplify the features to the base values when reporting them in ORES. But the model will use weighting, scaling, and division to try to gather more "signal" from the base feature values.

FeatureImportance
feature.revision.category_links0.005365923298399904
feature.(wikitext.revision.external_links / max(wikitext.revision.content_chars, 1))0.006630187213437086
feature.(revision.image_links / max(wikitext.revision.content_chars, 1))0.007117261348315173
feature.(wikitext.revision.headings_by_level(3) / max(wikitext.revision.content_chars, 1))0.007159772210726818
feature.wikitext.revision.external_links0.009000087563517686
feature.(len(<datasource.dutch.dictionary.revision.dict_words>) / max(len(<datasource.wikitext.revision.words>), 1))0.009558784392400578
feature.nlwiki.revision.cn_templates0.011404022152760856
feature.(wikitext.revision.wikilinks / max(wikitext.revision.content_chars, 1))0.015171176477007836
feature.revision.image_links0.019675883178346495
feature.(dutch.stemmed.revision.stems_length / max(wikitext.revision.content_chars, 1))0.02008636288117727
feature.(wikitext.revision.headings_by_level(2) / max(wikitext.revision.content_chars, 1))0.022257976284638322
feature.wikitext.revision.headings_by_level(2)0.02632808748125348
feature.wikitext.revision.headings_by_level(3)0.028044456500147065
feature.(revision.category_links / max(wikitext.revision.content_chars, 1))0.028815256064704614
feature.(wikitext.revision.ref_tags / max(wikitext.revision.content_chars, 1))0.029551519878959952
feature.wikitext.revision.wikilinks0.0476863450079613
feature.wikitext.revision.ref_tags0.052012990748393446
feature.(enwiki.revision.paragraphs_without_refs_total_length / max(wikitext.revision.content_chars, 1))0.05597827983284357
feature.(nlwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))0.07324425132632534
feature.wikitext.revision.content_chars0.09022370577873072
feature.len(<datasource.dutch.dictionary.revision.dict_words>)0.09814909788518943
feature.dutch.stemmed.revision.stems_length0.10363425462142278
feature.enwiki.revision.paragraphs_without_refs_total_length0.10625534037086251
feature.wikitext.revision.chars0.12664897750247775

Thanks for the great meeting we had today!

Could the output of https://ores.wikimedia.org/v3/scores/nlwiki/123125/articlequality be enriched with a link to the actual examined version [https://nl.wikipedia.org/w/index.php?oldid=123125] and have the exact score besides just the qualification?

The output of https://ores.wikimedia.org/v3/scores/nlwiki/123125/articlequality is pure JSON and links are not possible in this data format.

and have the exact score besides just the qualification?

I'm not sure what you mean here. Would you like to see the raw probability distribution rendered through JS on that version of the article?

Hi @Halfak,

Would it then be possible to show the link, like https://nl.wikipedia.org/w/index.php?oldid=123125 ? That would make it easier for firsttimers to get there. Copy & paste is common usage.
And it would be great if the JSON-output would be something like
+ score
+ + prediction "D"
+ + rating 2.7
(A more meaningful word than rating would be fine).

I see. You're asking to include the "weighted sum" measure in the JSON output?

Yes, that is the right name of the value.

Hey!
@Ciell I have added the samples of all the classes on @Halfak Sandbox.
As discussed it would be great if you could move it to a suitable page in order to get reviews from members of the community on the predictions. Members could simply add their notes on particular articles where "Laat hier je notities achter" is mentioned.

Thank @Psingh07 and @Halfak!
I shared it with the community tonight.

I am just looking at the table @Halfak gave on June 3. One of the parameters say 'enwiki' in stead of 'nlwiki' (feature.enwiki.revision.paragraphs_without_refs_total_length). Could this be a reason for strange results? It is used twice in the calculation.

Notes from 2021-07-15:

  • To Do: How to do non destructive git revert in revscoring
  • Use revscoring 10.0 and rebuild all models. Use deltas unofficially to rebuild all models and talk about the stats.

We are waiting for this bug to be fixed, before we can move forward again.

We're unblocked with new work. We have new code ready for modeling/testing that improved unsourced content detection.

We also have done some of the background work to improve the detection of articles that contain an over-abundance of bullet points. I'll be getting that implemented and tested as part of the next cycle of work.

Interesting suggestion in our Wikipedia:De Kroeg to turn around A-E classes and make A the minimum starters class, and the E level the highest.
If our standards would go up over time, we could just move on to an extra F-class that would supersede the now highest E-class, instead of creating less obvious 'A+', 'A++' and 'A+++' classes.

I already commented in WP:DK. I think it is more likely to get classes between the current classes than above (or below) the current classes. Also, as all classifications of all revisions are calculated at the moment they are shown using the model of that moment, there will be no (stored) changes to articles needed to change a current A classification to a B classification (or whatever). When we change the conditions an article has to comply to to get a certain classification, that will just be fine.

American school classifications have an A as excellent and a E as definitively too poor, Dutch swimming certifications start with an A, so for both directions there are arguments.

It should be OK to change the meaning of the current classes over time too. One nice thing about using an ML model to supplement quality assessment is that it is easy to propagate changes like that. E.g. if we adjust the definition of a quality classes, we just need to review our training data (50-75 articles per quality class) to fix the labels and retrain.

In a related wiki (English Wikipedia), the definitions of classes have changed quite dramatically over time. For example, back in 2006, you could have a B class article with no citations! Now, you can't even have a stub without citations. So when we build models, we use more recent data to learn from and we apply today's criteria historically to examine the development of the encyclopedia. For the most part it makes sense because we're always looking at quality through today's eyes. The scale and our quality model should as well.

I should say, this pattern of retraining also works for between classes too.

Sorry. One final thought. We could make the quality classes non-ordinal. E.g. call the lowest class Beginnetje and the highest class Etalage, and develop common sense names for the classes in between. That way, order may be plainly apparent and in between classes would require a common sense name as well--rather than something like "B-" or "C+".

The interesting thing about that idea is that the lowest class and highest class likely 1) the easiest for a machine to predict and 2) least likely to change over time. So the top and bottom classes become firm anchors, between which is a slightly more harder to predict and ambiguous in-between classes.

I like the idea of class names above class labels. Some suggestions:
Etalage
Zeer goed
Goed
Redelijk
Beginnetje

ORES-discussion is taking off in the Village Pump/De Kroeg.

Interesting observation: ORES score drops when wiki-markup is added to an article and further plummets to a lower class when fields are added to the infobox, only to restore to the original score when two sentences of text are added.
The fields added to the infobox in the last edit linked were left empty, but should these edits decrease the score in such a way?

I like the idea of class names above class labels. Some suggestions:
Etalage
Zeer goed
Goed
Redelijk
Beginnetje

I think I would actually prefer not to use the same names as we are used to when doing manual scoring of an article. People could get confused "Is this an ORES assessment, or a human one?"
With just A-E (or E-A) it is clear that this is an AI assessment, and the Etalage or a Beginnetje is unambiguously a human assessment.

The interesting thing about that idea is that the lowest class and highest class likely 1) the easiest for a machine to predict and 2) least likely to change over time. So the top and bottom classes become firm anchors, between which is a slightly more harder to predict and ambiguous in-between classes.

Actually, over the years the top classes have changed already. Back in 2005 the Dutch article on Prince was considered FA ("Etalage"/A-class) on the Dutch Wikipedia, but this status was removed again in 2016 because we concluded that references and sources actually should weigh heavier than we valued them before. The article from 2016 had 5 general sources at the end of the article, and only a few inline citations of medium quality sources: this was not considered enough anymore.

Requirements for a specific class can change over time, but there will always be a top class (and a bottom class). The requirements for these classes are usually the easiest to formulate and the easiest to find out whether an article (revision) meets these. That a top class article according to the requirements of 2005 does not fit the requirements of 2016 (nor 2021) is not a problem. It will move to the next lower class or maybe even drop two classes. The distinctions between a top-C and a bottom-B article can be much less clear.

The ORES-code has been turned into a script to make the function more accessible for users to try.
(instead of adding the code to their personal common.js manually, they can now just tick the box for the quality scale in their settings.)

I added a wikitext.revision.list_items feature to revscoring for tracking articles that are in outline form (as opposed to prose). See https://github.com/wikimedia/revscoring/pull/506

Once this is merged, I'll use this and other improvements to re-generate the models. Then we can use those models to consider a new labeling campaign based on the new quality criteria.

Still waiting on a review/merge. In the meantime, @Psingh07 is working on gathering new labeled data from the reviewing work folks did on the wiki pages.

I was able to gather 64 new labels from the wiki. Most of them were E class, but we did get some B, C and D -- which are hard to differentiate.

new_manual_labels.png (557×889 px, 13 KB)

I think we'll need some more examples of C-class articles in order to train a new classifier that more closely matches the newly developed quality scale. But we can probably work with the other observations we have for now.

I think a good next step is to (1) train a model including this new data while we load up a labeling campaign in parallel. We can probably focus that labeling work on B, C, and D class.

This sounds like a good plan to me Aaron!

^ New version of the model using updated features and manually extracted labels.

Next step is to extract 75 probable C class articles. Load that into wikilabels. And require 3 labels.

Adds the nlwiki article quality scale form to Wikilabels: https://github.com/wikimedia/wikilabels-wmflabs-deploy/pull/53

Generates a sample of 100 articles from a recent wiki dump: https://github.com/wikimedia/articlequality/pull/169

I settled on sampling 25 probably D class, 50 probable C class, and 25 probable B class. I think this will be important to gather examples that fall out of the super-easy to learn char window we specified for C class previously. It will likely result in a bit more work, but that will also help us get a sharper prediction for D and B class too.

Once these are reviewed (only the wikilabels one is a real blocker), I can get the data loaded and we can start labeling.

I'm running into some issues with the wikilabels updates. Looks like some of our deployment code has gotten old and crusty (versions have changed and backwards compatibility dropped). So I'm working on that.

I was able to get the campaign loaded! See https://labels.wmflabs.org/ui/nlwiki/

This will require 3 labels by different people for each article. We'll be able to examine the disagreements after all of the labeling work is completed.

Send out the communication about the labeling campaign to the Dutch community just now.

four articles out of the 300 still need to be labeled, but both sets that are still in Wikipedia:Labels give an "$2" error (see screenshot).

Screenshot 2021-10-10 at 20-48-51 Labeling gadget.png (895×1 px, 75 KB)

Those last 4 must be checked out to someone in a workset. I think they were returned in the meantime because I was just able to check them out in a workset. I skiped them all so they should be available again.

We completed the labeling campaign and I produced a report of articles where the labelers disagreed here: https://nl.wikipedia.org/wiki/Gebruiker:EpochFail/Kladblok

It looks like there hasn't been any progress on that. @Ciell, has there been any progress on making sense of the citation criteria for D, and C class?

I will pick this up again next week, sorry for being absent in this!
Indeed there wasn't a lot of response last time, but I hope my presentation at the WikiConNL this weekend on ML at Wikipedia will give us new inspiration and talking points.

Great! I think once we settle this, the next steps will be obvious and (hopefully) will require less investment from Dutch Wikipedians.

There has not been much discussion, but everybody seems to agree on loosening the requirements for sources in the articles, which to the ones that answered to my ping was indeed the reason to give so many articles a lower score than ORES initially predicted in the former model.

Response to the outcome of the labeling campaign: https://nl.wikipedia.org/wiki/Overleg_gebruiker:EpochFail/Kladblok#Reacties
Quality scale: https://nl.wikipedia.org/wiki/Wikipedia:ORES/Kwaliteitsschaal_voor_artikelen

Do you think we could apply the new criteria to list of articles I have in my Sandbox? https://nl.wikipedia.org/wiki/Gebruiker:EpochFail/Kladblok

Ahh. That was more of an ask to Dutch Wikipedians to help choose what label those articles should ultimately have.

I could, however just assume that the max label is the right one if you think that makes sense.

No: they say something they labelled an E now, would for instance become a C becomes without the strict sourcing requirement the article would classify for the C. So this does really make a huge difference.

I don't think I can get the 11 participants to agree to 100 new labels in a discussion, but asking them to join in a new label campaign might be off putting as well.

If folks aren't interested in doing more labeling, it sounds like the best approach would be to just take the max label then from the set and see how well we can do with that.

E.g. Battle Mountain (52460837) has the labels: D, D, E.

If I just took the max label, I'd train ORES to recognize this article as a "D" class article.

Similarly Ulster Grand Prix 1960 (52652958) has labels: A, C, C.

If I take the max label, I'd train ORES to recognize this as an "A" class article.

final label column is now filled by one user, others are checking and leaving comments. I expect we can take this forward next Thursday.

Fantastic! I'll work to get something together before Thursday so we might be able to review then.

See discussion here about a new iteration of the model. https://nl.wikipedia.org/w/index.php?title=Overleg_gebruiker:EpochFail/Kladblok&oldid=60538637#Hodge_podge_of_data_and_building_a_new_ORES_model

Gist is that we're working with new data. It decreased the number of observations, but the observations are much more aligned with what people *mean* when they say "C-class". Thanks to @ACraze, the change is merged. That means I can start preparing for a deployment. That will involve a number of small PRs in order to bring all of the repos up to the new version of revscoring. (2.11.x)

I'll start a new task and link it here -- probably on Sunday some time PST.

I have 3 pull requests open that add version compatibility with revscoring 2.11 in prep for a deployment patchset.

Articlequality was already updated along with merging the new nlwiki model.

Once these are merged, I'll submit a patchset for the deployment configuration that will pull down the updated model. Then that will be ready to get deployed.

Woops! Almost forgot that I'd need to update the packages for the deployment as well. See also https://gerrit.wikimedia.org/r/c/research/ores/wheels/+/748390

@ACraze, thanks for your review of the model repo updates. Can you also look at the patchset linked above in T223782#7579930?

Here it is again for convenience: https://gerrit.wikimedia.org/r/c/research/ores/wheels/+/748390

Thanks @ACraze! It looks like I no longer have permission to manually mirror changes into the gerrit model repos. See https://wikitech.wikimedia.org/wiki/ORES/Deployment#Updating_model_repositories

Can you either restore my rights or perform these operations for me? It looks like the git lfs push gerrit master operation won't do everything we need anymore because basic mirroring is also not working anymore. So you'll need to be able to run git push gerrit master (note I dropped the "lfs") too in order to get the changes. This needs to be done for all of the model repos:

  • editquality
  • articlequality
  • draftquality
  • drafttopic

FYI, here is the config change. https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/755731 It is still a (WIP) while we wait on the model repo code to be manually mirrored.

Hi all,
@Ciell told me yesterday about the new model that is implemented. She suggested to implement links like https://ores.wikimedia.org/v3/scores/nlwiki/60941895 which could be helpful to get even more support for ORES. I had a look at magic words (https://en.wikipedia.org/wiki/Help:Magic_words), but might have overlooked REVISIONID or REVISIONNUMBER. Is anyone aware of a magic word that would give that number? If that does not exist, would it be hard to get that implemented?
Is there Lua-alternative to get the revision id of the currently viewed page (the page we want to show the most likely scale on)?

Thanks in advance

Hi,

Could the homepage, redirects and disambiguation pages please be excluded from the use of the quality scale on nlwp?
It is causing confusion for the Dutch users (1, 2)

  • Redirects use the magic word #Redirect
  • Disambiguation pages use {{tl|dp}} and {{tl|dpintro}} and are in [[Categorie:Wikipedia:Doorverwijspagina]]
  • Homepage is under [[Hoofdpagina]] ([[Home]] is used as a redirect here)

@Psingh07: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!