Page MenuHomePhabricator

Build article quality model for Dutch Wikipedia
Open, MediumPublic


How do Wikipedians label articles by their quality level?

What levels are there and what processes do they follow when labeling articles for quality?

How do InfoBoxes work? Are they used like on English Wikipedia?

Are there "citation needed" templates? How do they work?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I bet there is a big gap between 2 and 3, and most articles will be in there. But if ORES could help identify articles which belong in one of these four categories, I'd be happy if the remainder is in that gap. ORES could than, later on, help categorising the articles from the in between group and might identify candidates for the four categories.

Ciell and I will work on a list of articles in category 3.

Ha Ronnie,

Prima om dit samen op te pakken.
Jij hebt blijkbaar nog met Aaron gesproken over hoe en wat voor de 4e
categorie: zullen we samen een momentje plannen om hier een start mee te

Vriendelijke groet,

Op wo 27 mei 2020 om 22:31 schreef RonnieV <>:

RonnieV added a comment. View Task

I bet there is a big gap between 2 and 3, and most articles will be in
there. But if ORES could help identify articles which belong in one of
these four categories, I'd be happy if the remainder is in that gap. ORES
could than, later on, help categorising the articles from the in between
group and might identify candidates for the four categories.

Ciell and I will work on a list of articles in category 3.



*To: *RonnieV
*Cc: *Encycloon, Ciell, RonnieV, Aklapper, Halfak, Xinbenlv, Vacio,
Capankajsmilyo, Fz-29, notconfusing, Ricordisamoa, Alchimista, He7d3r

Hi all,

We propose to have 5 quality levels.
I created a Wikipedia page on the Dutch Wikipedia, so Dutch Wikipedians can follow and comment.

Thanks for the notes! I've added this to our sync meeting today.

Halfak raised the priority of this task from Lowest to Medium.Jun 23 2020, 4:27 PM

Hi Aaron and team,

Very curious about how things are developing with the quality model. Are there any questions, do you need more or new imput?
I'd love an update on how things are going.


Hey Ciell! Right now with Aaron leaving and other things we are down to a very very small team. In a few weeks we will have more people to start to look at this.

I juts looked at this with Ciell. In addition to looking for articles in the "Rough Diamonds" page, we can also look for articles that appear in a level 3 header on this page: but do not appear in the category of featured articles here:

All new articles are being checked on the Dutch Wikipedia. If they have never been templated they are D or higher.

Maybe we can use a heuristic of everything between 3-5k bytes, that might help us narrow down to C.

Here's the process I propose:

  1. A-level articles: Get all the A-level articles from the category
  2. B-level articles
    1. Gather the articles from
    2. Also, use the A-level articles to find likely B-level articles by scanning the history of and excluding anything known to be A.
  3. C-level article: Randomly sample articles with between 3 and 5k bytes
  4. D & E-level articles: Do a scan looking for the introduction of {{Beginnetje}} and removal of the template
    1. Tag versions that introduce the template with E-level
    2. Tag versions that remove the template with D-level
This comment was removed by Ciell.

This query gets all of the articles in the A-level category:


If I understand correctly, we need the 'articlequality', 'articletopic', and 'wp10' to be switched on, so we can play around and try thing with different templates and the .js.
Do we have to wait for the first modelling-work to be complete in order to make things work?

Hey! Yep, the modeling work needs to be completed first.

I created this gist to show how to extract the stub class from XML dumps for nlwiki:

So I have been trying to run the script @Halfak has linked in his previous comment on ores-misc-1. The output is slightly confusing.

Since the script @Halfak wrote was just to give us a rough idea, there was some debugging I had to do. But after debugging and testing the script about half a dozen times, the script when run on the XML dump produced no output. Which can mean one of two things, either there's something wrong with the script or there are no Class E and D samples in the dump. For everyone's reference I am attaching the script here:

Just giving a status update. I will keep trying to fix whatever it is that is causing the problem. Any help is appreciated.

I have been looking at this script. It needs two more lines at the end:

if name=="main":


Without these lines, it does not do anything. But after adding these lines (and installing docopt en mwxml), I would expect it to run.

py nlwiki-20201101-pages-meta-history1.xml-p1p134538.bz2
does give a usage message. It looks like the call to docopt.docopt(doc) throws an error. I couldn't figure out why (yet).

Switching to a smaller file (lawiki) does give the same problem.
py lawiki-20201101-pages-meta-history.xml.bz2

Can @Halfak clearify which dump he intended to use? Looking at [] I see a lot of different dumps. As the script seems to look for more versions of a page in the same file, I intend to look at nlwiki-20201101-pages-meta-history1.xml-p1p134538.7z 1.3 or nlwiki-20201101-pages-meta-history1.xml-p1p134538.bz2 Should we run the script for the six files, either giving all six of them as parameters, or one after another? Should we unzip the file (1.3GB in .7z, 6.2GB in .bz2) to the normal version of 150GB?

Wait, I dropped docopt and now it seems to run, with some changes. Some attribute errors are thrown, so I see it is working, but I got no clue how long it will take nor whether all revisions are refused, or just some. I'll upload the changed script.

My version, to run the first file, is at
Unfortunately, it will just give a result after processing the whole file, 150GB of (unzipped) data.
The attribute errors seem to be thrown for hidden versions, like Seems like a good reason to exclude the page from the zipped version.

Running the script (seems to) take(s) a lot of time. I ran it against the nlwiki-20201101-pages-meta-history2.xml-p134539p484052-file for page numbers up to 136000, and got a result in json. I added it to the gist mentioned in the previous message, so we can see what result can be optained.
@Chtnnh, will this be enough for you to run it, see what the outcome is and so something with that?

@RonnieV we are importing the main() function in an executable file called utility and running that using
./utility /mnt/data/xmldatadumps/public/nlwikimedia/latest/nlwikimedia-latest-pages-meta-history.xml.bz2 --processes=8 --output=../output.json
so the __name__ == "__main__" is not an issue.

@Halfak has already specified what dump to run the script on, which is mentioned in the command above.

As far as the attribute errors are concerned I have added a try catch block, to handle that as well so there's no issue there. There is still no output from the file so something is still missing.

You can check out the latest version of the file from

Any help is appreciated!

@Chtnnh Fine that you are running it from a utility script. The __name__ == "__main__" won't be an issue than, but it won't hurt either and makes it possible to run the script without using utility.
Your source file seems to be somewhere on a central computer. That's fine, I'm running the script on my home computer. In the public dumps there are six files for groups of pages, with a combined (packed with bz2) size of approx. 36 GB. Can you conform that size for your /mnt/data/xmldatadumps/public/nlwikimedia/latest/nlwikimedia-latest-pages-meta-history.xml.bz2 file, to make sure we are using the same source? pages-meta-history.xml sounds good to me.

The script immediately creates an output file, but will not write to it until after reading the whole file. That might take quit some time. A quick incomplete result can be obtained by adding

if > 1000:

in the for page in dump:-loop. This should give a json-file with results within less than a minute.

My computer is now running against the six bz2 files (two at a time, to prevent the CPU from being complete busy). I added some lines to give some output on the screen whilst running:
` if % 1000 == 1:

print (, datetime.datetime.utcnow(), page.title)`

(don't forget to import datetime). It skips sometimes, as the x001st page does not exist any more, but it gives some information.

if page.namespace == 0:
for the revisions-loop saves the script from reading pages from other namespaces, as beginnetjes should be in the article namespace, other namespaces can be ignored (saves reading all these revisions, and might prevent some false positives).

Computer did read the history1 -file, results have been made available at
The other five will follow, once finished. It currently seems to stop whilst processing the second file, somewhere after page 336001 (taking one and a half hour to get there) It did continue later on, after taking more than 45 minutes for just 1000 pages. The third file is still being processed.

Does this help you?

The results from file 4 and 6 are added. These gave .json-files from 1.9 and 2.3 MB. The results from history5 are a bit to big to add: a json-file from 27,4 MB is hard to add to github.
History5 contains pages added between December 2010 and February 2014.

I don't get it why Chtnnh does not succeed to run the script, I hope the hints from yesterday help him.
If not, the files from the five other parts do give probably enough pages and changes to get the initial version running.

This last remark reminds me of something I wanted to add: Pages are (except for the first few thousand, as there has been a restart which lost track of the original ordering) numbered in the order in which they are created. As file history5 ends with pages created in February 2014, it is likely to find the more recent opinions on when a page is a 'Beginnetje' in the newer pages, so focusing on history6 might be wise. There will be removals of 'Beginnetje' from older pages, contained in history1..5, but that might be a fewer pages. RevisionId 40371396 is an edit of 14 February 2014, 45656960 is made on 1 January 2016 and 50643818 on 1 January 2018.

I just started working with @Psingh07's dataset.

$ cat nlwiki-20201101-E_and_D.json | wc -l
$ cat nlwiki-20201101-E_and_D.json | json2tsv label | sort | uniq -c
    551 D
  38941 E

~40k labels, mostly "E" -- adding the stub template.

$ cat nlwiki-20201101-E_and_D.json | grep -P '"username": ".*[bB]ot.*' | json2tsv username | sort | uniq -c
   1237 BotMultichill
     38 BotteHarry
      1 Erwin85TBot
  36520 LymaBot
      4 Robotje
    306 RobotMichiel1972
    167 RomaineBot

We've got some very active bots in this dataset.

 34 2005
 11 2006
 24 2007
  3 2008
 36 2009
149 2010
253 2011
 39 2012
 45 2013
394 2014
 59 2015
 42 2016
 32 2017
 41 2018
 31 2019
 26 2020

There we some bursts of human editor templates in 2014.

$ cat nlwiki-20201101-E_and_D.json | grep -P '"username": ".*[bB]ot.*' | json2tsv timestamp wp10 | cut -d- -f1 | sort | uniq -c
   1241 2009
    377 2010
     65 2011
      1 2012
  36581 2014
      4 2015
      1 2016
      1 2017
      1 2019
      1 2020

Looks like the bots are mostly inactive since 2014.

I extended the query described in

$ cat nlwiki-20210116-A_B_and_C.json | json2tsv wp10 | sort | uniq -c 
    364 A
    105 B
    500 C

It looks like we'll need to get more clever to find more "B" class articles. No matter, we can probably at least start experimenting with this.

$ cat datasets/nlwiki-20201101.balanced_sample.json | json2tsv wp10 | sort | uniq -c
    100 A
    100 B
    100 C
    100 D
    100 E

Let's start with a balanced dataset.

You can see the data I generated here:

Next steps, we'll want to share this sample with dutch Wikipedians to make sure it reflects their expectations and we'll also want to try building a model with this dataset.

Hi @Halfak,, @Psingh07 and others,

Thanks for these big steps forward. I looked at the generated dataset. The E-category is swamped with items from 'Beginnetje biologie'. I know, that is by far the biggest group amongst our beginnetjes. But almost al these articles are botimports and considered as beginnetje merely for being created by a bot. Would it be possible to get a new selection for the E-group whilst eliminating 'Beginnetje biologie'? That would make this group far more interesting to look at.

I also came across two articles in user space 2. This can be rescued articles from which the removal of the beginnetje-template is just done because it was moved out of the main space and not having to do anything with the quality of the article at the specific moment.

Instructions for the people who are going to look at these specific revisions, which might be far from the current state of the article.

Hi, I second that.

Referring to our notes from December 17th, please exclude bots from the following users from the dumps:

As discussed we plan to add username for analysis, to identify bots making the edits
Excluding stubs ('beginnetjes') in 5.json by username, there were ~3-4 bots creating articles.

  • Joopwikibot
  • RobotMichiel1972
  • Kvdrgeus

Thanks for your time and attention today

I'm looking at the new dataset (2021-1-21) (first line: {"timestamp": "2018-12-04T09:52:55Z", "rev_id": 52732712, "wp10": "D", "page_title": "Achthophora chabacana", "username": "Vitalfranz", "page_namespace": 0}

I see versions in the B-set that are redirects, like rev: 39297089, 17071877, I see pages outside the main space, like rev:50356068, user (talk) pages, like rev:48867116, 56505834, 58020315, 55627603
Length of the C-class articles (3-5kB) is fine to start with, lets get the system working. In the final model, the length should not be a major value to rate articles.
Links from previous versions of to Gebruiker... or User... should all be excluded
E-class still has to many biology-stubs. Would it be possible to exclude these completely {{Beginnetje|biologie , like in

Would it be hard to exclude all redirects, all pages outside namespace 0 and all revisions (E-class and if possible D-class (removal of)) with {{Beginnetje|biologie for a new set?

Excluding redirects can be reached by requiring a minimal length of (say) 250 or 1000B. For A and B, 1000 should be no problem, for E this might be too long.

I've updated to exclude redirects and limit results to main namespace pages.

I've updated our bot filter to filter *out* bots rather than limiting the dataset to bot activities. Woops.

@Psingh07 and I collaborated on a script that gathers failed "A" class nominations. We were able to gather an additional 257 examples. So by adding that to our dataset, I can boost the size 3.5x.

The results of all of this work can be seen in the repo along with a new balanced dataset:

I just created an initial model with basic features. See

It looks like we're getting very decent fitness. I think we can do a lot more with targeting features of articles. E.g. does nlwiki use infoboxes? Are there cleanup templates we can look for that would suggest quality issues? Anyway, we can leave that for follow-up work.

We'll still need to get this deployed before it can be used. I am talking to @calbon next week so that might let us put a date on when this will be available for experimentation.

It looks like we're getting very decent fitness. I think we can do a lot more with targeting features of articles. E.g. does nlwiki use infoboxes? Are there cleanup templates we can look for that would suggest quality issues? Anyway, we can leave that for follow-up work.

We're not fans of adding a lot of cleanup templates.
Most common are Wikify (to mark an entiry article needs clean up, but no need to remove it), Twijfel (there is doubt about the article) and Bron? (citation needed), and Wanneer? (when did the claim happen?)
And yes, we do use infoboxes see and subcategories.

Thank you, @Mbch331! Very helpful.

@Psingh07, I think this is a good next step for your work. See This is the initial feature set. It's very minimal. In comparison, check out the one for English Wikipedia: It will give you an idea of how we can use the templates that Mbch331 references above.

Our pattern will essentially be this:

  1. Work with folks in this thread to learn good ways to track quality-related stuff in articles
  2. Implement as features
  3. Run make nlwiki_models and make nlwiki_tuning_reports as necessary to check for fitness changes.
  4. Submit a new PR with better models. (Goto 1)

The mantel of this task has shifted to new volunteer Paritosh. @Psingh07

Hey folks!
@Halfak and the whole team is very excited and delighted that we were able to get a deployment done for the articlequality model. If you wish to see ORES scored on pages, copy this to your own common.js page following this link. Let us know how it works and if it aligns with your expectations and also if you were happy with the performance or not. We're available to make improvements to the model and to optimize it to your expectations and improve on any shortcomings, feel free to link us to the problem anyone faces and we will do our best to help.

Thank you so much @Psingh07 , this looks great already!

I can see the rating under the article title, and the colours in the page history. Nice!
Would it be possible to display the colours that are shown in the page history in the rating on the article page as well? (similar as on en:)

@Halfak This means we can start testing soon right? Could we maybe exchange ideas on how to set this up on-wiki, so I/we can start preparing for this?

Sure! We can even use local templates. Would you be interested in creating templates with badges/colors you like for the prediction that appears on the top of the page?

Also, it seems like we should translate: "ORES predicted quality". What might that look like in dutch?

I am not great at creating templates. Do you maybe have a link to the code on en-wp?

I think we would translate the header to "Kwaliteitsinschatting door ORES"

Thank you, @Psingh07, it does look great indeed!

I agree with @Ciell on the label name: "Kwaliteitsinschatting door ORES".
I started a template some time ago, at, but that does not seem to do the job. What extra info, link, .... do we need? It seems to me the template should be called from somewhere and that does not seem to be the case. I guess I was inspired by a template on en.wikipedia, but I do not seem to have stored the source (and can't find it back either). With some helpful hints, I will see if I can fix it.

I updated the language for the tool so that it should show the dutch version now.

Re. templates, you can see what English Wikipedia has here:

You might consider re-using the icons from enwiki in nlwiki. I made a start of that here:

I also added an example usage of {{ORES schaal}} to that page so you can see how it works. If you have the ArticleQuality.js user script enabled, it should add a predicted quality level to the link.

Thank you very much for these examples!
I have had the code for the templates restored (thanks, @Ciell). The template works, I do have to fix some links (not just pointing to 'Categorie:A-klasse artikelen' for classes A through E), but to the etalage-artikelen (FA-articles) and Beginnetjes (stubs) I guess I should point this from the Class template.
Do I have to call the Class template somewhere to have the outcome shown together with the ORES rating, or should @Psingh07 do this in his part of the code?

I've added the class template to the user script code. It looks like the output of the script (on the top of the page) looks right now.

Thanks, @Halfak , that did it!
I will redirect the categories later this (long) weekend, or think of another solution (maybe a descriptive page of the rating). As far as I see in the English Wikipedia, the articles are placed in categories based on a manual edit in the talk page using the vital article template. I think the Dutch Wikipedia is not ready (yet?) for a massive manual rating of articles.
It would be great, but I estimate also a heavy action, to rate all articles on a regular basis using ORES. But lets first see what we can do with the current information and run some user testings.

Thank you for the colours Ronnie, I like them!

I would not be a fan of giving the articles a place in the normal category-tree yet, but yesterday I created Categorie:Wikipedia ORES.

We think we could test with (hidden) subcategories in this category, but I would prefer to wait for the results of the first labelling tests to refine and see community response. (should be about 3-4 weeks, max)

I created the translation of the way we compiled the scale:

And updated the test-page: People are invited to add the code to their common.js an share the mistakes in the wild. I shared that we expect to start testing in June.

Would it be possible to exclude certain pages, like for instance our main page?

Hi @Ciell , I did not mean to create 'normal' categories (although Ecritures created categories like for A-C), but pages like the explainatory for all five levels. Maybe using some of the suggestions Halfak made at ( for B). Having five red links does not feel very satisfying.

For the main page, it consists of comment, a template calls, a category and a second template call. Maybe pages without own content can be excluded?

With this edit A links to Wikipedia:Etalage, B to Wikipedia:Ruwe diamanten, E to Wikipedia:Beginnetje and C and D to Wikipedia:ORES/Kwaliteitsschaal voor artikelen. We can easily change it to other pages, if more relevant pages exist.

Yeah! We can exclude the Main page. I wonder if there is a good way to identify if we're loading the script on the main page in a wiki/language independent way. In the meantime, I'll look into making a special case for nlwiki.

One other thought, do you want it to load on userspace pages? E.g. User:<name>/Sandbox or the equivalent. Anywhere else on the wiki that people draft articles other than mainspace?

Main page seems to be protected on Wikipedia's in surrounding languages (is it in all?) Is protection a reason not to classify the page? I think this is too tight. Permanent protection could be better. [:da:]nnish and [🇩🇪]eutsch|german have hardly any content in the code, just like the Dutch. The {;fr:]rench, [:en:]glish and [:pt:]|Portuguese do have some content in the code of the page. Just going for 'no content' does not seem to be enough, but excluding pages without content might help to save similar pages from being rated. Is it easy (and not too heavy) to head for Wikidata:Q5296? Is it too limited to our ecosystem?

On the Dutch Wikipedia, we don't have a draft namespace (yet), so people can start writing anywhere in their own namespace User:<name>/Kladblok is the Dutch personal sandbox, but User:<name>/Art_Gensler could be a draft too. I do have some pages in User:RonnieV/... which are not intended to be published (nor classified), but it could help people to have their content above D-level before publishing. User:<name> should not be classified, neither anything in Talk_User:-namespace.

Hi there,

Thanks very much for the efforts this far.

As we plan to have people test the outcomes, the question rises when articles get which qualification. See [] for some discussion. seems to make a difference on the integer values 1-5, but I might be misinterpreting this. It does not seem to turn a decimal number into an integer, so this might be just the five levels we agreed upon and not actually be the boundries between the different qualification. seems to contain more logistics. I see (from line 259) a calculation of the weightedSum, by summing probas and their respective weights. The weights are given to the function as part of the options and the probas as part of the given score. I guess the options are language specific (could differ between the Dutch, the English and the Portuguese Wikipedia, but not between different pages of a specific language), and the probas are calculated somewhere, based on the content of the page.
Which parts do contribute to a score? Can we find somewhere how the different probas of a specific page are calculated? (E.g. counting the number of references or the number of outgoing links) What are the weights of different options?
Can this information be published somewhere?

I said above that weights would probably be just language specific. I can imagine through that a namespace can make a difference as well. On the Dutch Wikipedia we are not fond of having articles out of the main namespace show up in regular categories. We also do not like mainspace articles pointing towards drafts and other texts outside the main space. I can imagine that articles outside the main space, if we would rate them and these two options do earn points, they might get some justification to compensate for not being able to score on these two options, either by ignoring this scoring option in the calculated weight and in the outcome or by pretending at least one category would be provided.

I realise that black box testing is a valid way of testing (does article A deserve a higher expectation than article B) and is very important in this phase. Code aware testing does detract attention from this part and bring it more to testing whether the code does actually do an accurate counting of (e.g.) the categories, or does a right summation of the different weights and probas. I am pretty sure these latter tests are already performed in the other (14?) Wikipedias using ORES predictions, so should not get to much attention.

Clarification would be highly appreciated.

Sorry for the late response. The holiday weekend in the US (memorial day) had me out of my usual flow.

For a discussion of the numerical values associated with scores, see page 5. Essentially, this measure gives you a finer estimate of the center of the probability distribution across classes. Usually, you should see that the predicted class (the most likely class) aligns well with the "weighted sum" numerical value. This value can help you distinguish between a low C and a high C article.

For the features that go into a score, the best way to view that is to add ?features to a request to ORES. E.g., This will list out the raw values from which the model learns and makes predictions. As far as weights, that's more difficult to inspect. The underlying model (a GradientBoosting classifier) builds a sort of decision tree like structure that doesn't explicitly assign a weight (or coefficient) to each feature. However, we can inspect the "feature importance" which will give you a sense for how much each feature ultimately affects predictions.

I can imagine that articles outside the main space, if we would rate them and these two options do earn points, they might get some justification to compensate for not being able to score on these two options, either by ignoring this scoring option in the calculated weight and in the outcome or by pretending at least one category would be provided.

I'm not sure I'm following this. Are you suggesting that you'd like different weights for the "weight sum" calculation outside of main space? Or maybe you want predictions to operate differently outside of main space?

For the formal evaluation of the model, you can add ?model_info to the URL for a score to see the fitness statistics. e.g. This is a bit of a strange request because it is also generating a score. If you just want the fitness statistics and other info, you can do something like this:

Re. black box vs. code aware testing, I agree that, for folks who want to dig in, talking about the code can help. But I think it is far more informative to talk about what data we're training on and how that was generated than looking at the internals of the model itself. E.g., C class is strictly defined as an article with a certain number of chars. That's weird. It may work in practice, but we should look out for articles that exist in the area between D and B for issues. That said, the feature importances analysis I talked about earlier can be quite informative -- e.g., when a feature gets an unexpectedly high or low weight, we can assume something strange is happening. I'll get that together soon and post it here.

Halfak, thanks for you elaborated answer. I will dive into the log after a good night of sleep.

A quick reaction for now.
123125 in the second and third link seems to stand for a revision id (the first version of the article about the movie Gia).

When I read the features (and understand them well), an article does get a higher rating when it has categories (feature.revision.category_links). Articles out of the main space (for instance drafts) should (local policy) not be in a normal category. The article will get a better score as soon as categories are added after moving it to the main space. At least something to be aware of.
At the Dutch Wikipedia, there is no bonus in getting an article in as many categories as possible. I would even say: more than five, six categories seems to be a bummer, not a bonus. It would be good to see this somewhere in process.

Here's the importance table. The higher the importance score, the more important the value is to the prediction. It turns out that the count of category links is the least important feature of the set. Overall length of the article, the amount of content with references, and the proportion of content that is referenced are the dominant features.

One thing you'll note here is that many features are expressed as something like feature_1 / feature_2. But you don't see that in the output of ORES when providing ?features as an argument. This is by design. We simplify the features to the base values when reporting them in ORES. But the model will use weighting, scaling, and division to try to gather more "signal" from the base feature values.

feature.(wikitext.revision.external_links / max(wikitext.revision.content_chars, 1))0.006630187213437086
feature.(revision.image_links / max(wikitext.revision.content_chars, 1))0.007117261348315173
feature.(wikitext.revision.headings_by_level(3) / max(wikitext.revision.content_chars, 1))0.007159772210726818
feature.(len(<datasource.dutch.dictionary.revision.dict_words>) / max(len(<datasource.wikitext.revision.words>), 1))0.009558784392400578
feature.(wikitext.revision.wikilinks / max(wikitext.revision.content_chars, 1))0.015171176477007836
feature.(dutch.stemmed.revision.stems_length / max(wikitext.revision.content_chars, 1))0.02008636288117727
feature.(wikitext.revision.headings_by_level(2) / max(wikitext.revision.content_chars, 1))0.022257976284638322
feature.(revision.category_links / max(wikitext.revision.content_chars, 1))0.028815256064704614
feature.(wikitext.revision.ref_tags / max(wikitext.revision.content_chars, 1))0.029551519878959952
feature.(enwiki.revision.paragraphs_without_refs_total_length / max(wikitext.revision.content_chars, 1))0.05597827983284357
feature.(nlwiki.revision.cn_templates / max(wikitext.revision.content_chars, 1))0.07324425132632534

Thanks for the great meeting we had today!

Could the output of be enriched with a link to the actual examined version [] and have the exact score besides just the qualification?

The output of is pure JSON and links are not possible in this data format.

and have the exact score besides just the qualification?

I'm not sure what you mean here. Would you like to see the raw probability distribution rendered through JS on that version of the article?

Hi @Halfak,

Would it then be possible to show the link, like ? That would make it easier for firsttimers to get there. Copy & paste is common usage.
And it would be great if the JSON-output would be something like
+ score
+ + prediction "D"
+ + rating 2.7
(A more meaningful word than rating would be fine).

I see. You're asking to include the "weighted sum" measure in the JSON output?

Yes, that is the right name of the value.

@Ciell I have added the samples of all the classes on @Halfak Sandbox.
As discussed it would be great if you could move it to a suitable page in order to get reviews from members of the community on the predictions. Members could simply add their notes on particular articles where "Laat hier je notities achter" is mentioned.

Thank @Psingh07 and @Halfak!
I shared it with the community tonight.

I am just looking at the table @Halfak gave on June 3. One of the parameters say 'enwiki' in stead of 'nlwiki' (feature.enwiki.revision.paragraphs_without_refs_total_length). Could this be a reason for strange results? It is used twice in the calculation.

Notes from 2021-07-15:

  • To Do: How to do non destructive git revert in revscoring
  • Use revscoring 10.0 and rebuild all models. Use deltas unofficially to rebuild all models and talk about the stats.

We are waiting for this bug to be fixed, before we can move forward again.

We're unblocked with new work. We have new code ready for modeling/testing that improved unsourced content detection.

We also have done some of the background work to improve the detection of articles that contain an over-abundance of bullet points. I'll be getting that implemented and tested as part of the next cycle of work.

Interesting suggestion in our Wikipedia:De Kroeg to turn around A-E classes and make A the minimum starters class, and the E level the highest.
If our standards would go up over time, we could just move on to an extra F-class that would supersede the now highest E-class, instead of creating less obvious 'A+', 'A++' and 'A+++' classes.

I already commented in WP:DK. I think it is more likely to get classes between the current classes than above (or below) the current classes. Also, as all classifications of all revisions are calculated at the moment they are shown using the model of that moment, there will be no (stored) changes to articles needed to change a current A classification to a B classification (or whatever). When we change the conditions an article has to comply to to get a certain classification, that will just be fine.

American school classifications have an A as excellent and a E as definitively too poor, Dutch swimming certifications start with an A, so for both directions there are arguments.

It should be OK to change the meaning of the current classes over time too. One nice thing about using an ML model to supplement quality assessment is that it is easy to propagate changes like that. E.g. if we adjust the definition of a quality classes, we just need to review our training data (50-75 articles per quality class) to fix the labels and retrain.

In a related wiki (English Wikipedia), the definitions of classes have changed quite dramatically over time. For example, back in 2006, you could have a B class article with no citations! Now, you can't even have a stub without citations. So when we build models, we use more recent data to learn from and we apply today's criteria historically to examine the development of the encyclopedia. For the most part it makes sense because we're always looking at quality through today's eyes. The scale and our quality model should as well.

I should say, this pattern of retraining also works for between classes too.

Sorry. One final thought. We could make the quality classes non-ordinal. E.g. call the lowest class Beginnetje and the highest class Etalage, and develop common sense names for the classes in between. That way, order may be plainly apparent and in between classes would require a common sense name as well--rather than something like "B-" or "C+".

The interesting thing about that idea is that the lowest class and highest class likely 1) the easiest for a machine to predict and 2) least likely to change over time. So the top and bottom classes become firm anchors, between which is a slightly more harder to predict and ambiguous in-between classes.

I like the idea of class names above class labels. Some suggestions:
Zeer goed

ORES-discussion is taking off in the Village Pump/De Kroeg.

Interesting observation: ORES score drops when wiki-markup is added to an article and further plummets to a lower class when fields are added to the infobox, only to restore to the original score when two sentences of text are added.
The fields added to the infobox in the last edit linked were left empty, but should these edits decrease the score in such a way?

I like the idea of class names above class labels. Some suggestions:
Zeer goed

I think I would actually prefer not to use the same names as we are used to when doing manual scoring of an article. People could get confused "Is this an ORES assessment, or a human one?"
With just A-E (or E-A) it is clear that this is an AI assessment, and the Etalage or a Beginnetje is unambiguously a human assessment.

The interesting thing about that idea is that the lowest class and highest class likely 1) the easiest for a machine to predict and 2) least likely to change over time. So the top and bottom classes become firm anchors, between which is a slightly more harder to predict and ambiguous in-between classes.

Actually, over the years the top classes have changed already. Back in 2005 the Dutch article on Prince was considered FA ("Etalage"/A-class) on the Dutch Wikipedia, but this status was removed again in 2016 because we concluded that references and sources actually should weigh heavier than we valued them before. The article from 2016 had 5 general sources at the end of the article, and only a few inline citations of medium quality sources: this was not considered enough anymore.

Requirements for a specific class can change over time, but there will always be a top class (and a bottom class). The requirements for these classes are usually the easiest to formulate and the easiest to find out whether an article (revision) meets these. That a top class article according to the requirements of 2005 does not fit the requirements of 2016 (nor 2021) is not a problem. It will move to the next lower class or maybe even drop two classes. The distinctions between a top-C and a bottom-B article can be much less clear.

The ORES-code has been turned into a script to make the function more accessible for users to try.
(instead of adding the code to their personal common.js manually, they can now just tick the box for the quality scale in their settings.)

I added a wikitext.revision.list_items feature to revscoring for tracking articles that are in outline form (as opposed to prose). See

Once this is merged, I'll use this and other improvements to re-generate the models. Then we can use those models to consider a new labeling campaign based on the new quality criteria.

Still waiting on a review/merge. In the meantime, @Psingh07 is working on gathering new labeled data from the reviewing work folks did on the wiki pages.

I was able to gather 64 new labels from the wiki. Most of them were E class, but we did get some B, C and D -- which are hard to differentiate.

new_manual_labels.png (557×889 px, 13 KB)

I think we'll need some more examples of C-class articles in order to train a new classifier that more closely matches the newly developed quality scale. But we can probably work with the other observations we have for now.

I think a good next step is to (1) train a model including this new data while we load up a labeling campaign in parallel. We can probably focus that labeling work on B, C, and D class.

This sounds like a good plan to me Aaron!

^ New version of the model using updated features and manually extracted labels.

Next step is to extract 75 probable C class articles. Load that into wikilabels. And require 3 labels.

Adds the nlwiki article quality scale form to Wikilabels:

Generates a sample of 100 articles from a recent wiki dump:

I settled on sampling 25 probably D class, 50 probable C class, and 25 probable B class. I think this will be important to gather examples that fall out of the super-easy to learn char window we specified for C class previously. It will likely result in a bit more work, but that will also help us get a sharper prediction for D and B class too.

Once these are reviewed (only the wikilabels one is a real blocker), I can get the data loaded and we can start labeling.

I'm running into some issues with the wikilabels updates. Looks like some of our deployment code has gotten old and crusty (versions have changed and backwards compatibility dropped). So I'm working on that.

I was able to get the campaign loaded! See

This will require 3 labels by different people for each article. We'll be able to examine the disagreements after all of the labeling work is completed.

Send out the communication about the labeling campaign to the Dutch community just now.

four articles out of the 300 still need to be labeled, but both sets that are still in Wikipedia:Labels give an "$2" error (see screenshot).

Screenshot 2021-10-10 at 20-48-51 Labeling gadget.png (895×1 px, 75 KB)

Those last 4 must be checked out to someone in a workset. I think they were returned in the meantime because I was just able to check them out in a workset. I skiped them all so they should be available again.