Page MenuHomePhabricator

Article quality campaign for Persian Wikipedia
Closed, ResolvedPublic


In Persian Wikipedia we imitate English Wikipedia's wp10 quality model. The problem is that there are not enough people to assess quality of all articles so we only have stub articles, good articles, featured articles completely determined but there are also lots of articles that should be categorized as "B", "C", or "Start" but they haven't categorized.

So here's my suggestion: Do what we did with edit quality campaign, get a 20K sample, autolabel stub, featured and good articles and ask users for what's left. I think we should start with 20K because we have lots of stub articles that can be filtered out easily.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Halfak subscribed.

How do InfoBoxes work? Are they used like on English Wikipedia?

Are there "citation needed" templates? How do they work?

Re labeling, we can do a stratified sampling strategy where we select N articles above a certain length and N articles in the middle.

We should start with an analysis of what is currently labeled and then do some stratified sampling to fill in the gaps. We might be able to cut down the labeling work substantially -- e.g. just 1 or 2k labels.

we have both "infobox foo" and "جعبه اطلاعات .+?" so they are the same, yeah there are citation needed templates and they work like English Wikipedia.

Halfak triaged this task as Medium priority.Oct 9 2017, 9:05 PM

This is a random sample:

I analyzed the distribution of page sizes and it's obviously logarithmic, this is the distribution of them in log with natural base:

image.png (890×1 px, 32 KB)

I checked with lots of details and my conclusion is if we cut the page sizes over 4.6KB, and autolabel pages that has size less than the threshold and are tagged with stub (it's a simple regex) and leave out the rest (pages that are either bigger than 4.6KB OR doesn't have stub tag) for the users to label they would need to label around 40% of the actual data with highest accuracy possible around that size.

Looks good. I think you should *attempt* to sample 250 observations for each quality class that fawikipedians use and we'll see how the labels fall. We can always supplement by using an initial model to do a second sampling for classes that are lacking observations.

@Ladsgroup and I discussed this in IRC. I agree that we'll need to do some labeling. I think it would be good to analyze the length of the articles that are labeled and then sample from the apparent length distributions to arrive at a sample of edits. This sample of edits should, theoretically, bring each target quality class up to 250 observations. Then we'll have Wikipedians label the most recent version of the sampled articles and go from there. If we've done our job well, the articles will fall mostly into the classes we expect and we'll have a mostly balanced dataset. It's not critical that the dataset is perfectly balanced -- just that we have a representative # of examples for each class.

I looked into the good and featured articles in Persian Wikipedia and the wiki has 140 GAs and 126 FAs in total. I'm pretty sure we can't find more than 10 articles that are explicitly labeled as B, C, or start article.

I think the next step is to analyze the length distribution of articles and compare that to the distribution of GA and FA articles. Based on that, we should be able to do a stratified sample and then a Wiki labels campaign to fill in the gaps.

I did some analysis on distribution of page sizes over three groups: 1- Featured articles 2- Good articles 3- Random sample of 20K edits in fawiki. I also included three rules of thumb test results to see how much it looks like a normal distribution:

Good articles:
Mean: 86025.56428571428, Std: 53642.678362370425
First thumb (perfect distribution: 0.6827):0.7285714285714285
Second thumb (perfect distribution: 0.9545):0.9428571428571428
Third thumb (perfect distribution: 0.9973):0.9928571428571429

Good articles (logarithmic scale, natural base):
Mean: 11.17610422170933, Std: 0.6193441303194193
First thumb (perfect distribution: 0.6827):0.65
Second thumb (perfect distribution: 0.9545):0.9785714285714285
Third thumb (perfect distribution: 0.9973):1.0

Featured articles:
Mean: 105901.1507936508, Std: 54877.92924200001
First thumb (perfect distribution: 0.6827):0.7063492063492064
Second thumb (perfect distribution: 0.9545):0.9444444444444444
Third thumb (perfect distribution: 0.9973):1.0

Featured articles (logarithmic scale, natural base):
Mean: 11.358386020536, Std: 0.8259085763280916
First thumb (perfect distribution: 0.6827):0.8492063492063492
Second thumb (perfect distribution: 0.9545):0.9523809523809523
Third thumb (perfect distribution: 0.9973):0.9603174603174603

All edits:
Mean: 10767.023771394255, Std: 24060.237578560784
First thumb (perfect distribution: 0.6827):0.9326393754378941
Second thumb (perfect distribution: 0.9545):0.9658192373135822
Third thumb (perfect distribution: 0.9973):0.9779801821639476

All edits (logarithmic scale, natural base):
Mean: 8.40046899030434, Std: 1.1659457920429064
First thumb (perfect distribution: 0.6827):0.7597337603843459
Second thumb (perfect distribution: 0.9545):0.9313382043839455
Third thumb (perfect distribution: 0.9973):0.9915924331898709

OK so it looks like we have the following strata:


I'm working on a qualitative look at each of the orders of magnitude to see what kind of articles are there.


  • 10-100 bytes: Articles consist of a single sentence or are not actually articles at all -- just some sort of cleanup template. A couple of short lists or disambiguation pages.
  • 100-1000 bytes: Stubs. Solid stubs with a couple of sections.
  • 1k-10k bytes: Stubs with infoboxes and *maybe* start-class articles.
  • 10k-100k bytes: C to B class. Maybe a GA in there, but a random sample generally returns 10-15k length. Might want to add a strata for 50-100k
  • 100k-1m bytes: Everything else!

So I think I want to propose the following strata based on what I saw.

  • 500-5k
  • 5k-10k
  • 10k-50k
  • 50k-100k
  • 100k-200k, 200-300k, 300k-

For each of these, we should sample 20 revisions for the pilot. So we'll have 100 versions labeled. Once we have those, we can re-assess the strata ranges.

Here's the query to generate the dataset:

@Ladsgroup, can you put together a "feature_list" module for fawiki. Reference Note that you'll need to put together regular expressions for matching infoboxes and citation templates.

It looks great, I just finished a workset. I just need to mark the old campaign as inactive so it doesn't confuse people. Tomorrow, I will write up an announcement asking people to contribute

This is for the features:
Now we need to run another campaign I think \o/

OK for this campagain/sampling strategy, we got:

  • 24 stubs
  • 21 start
  • 22 c
  • 29 b
  • 0 ga
  • 3 fa

It seems that we have failed at pulling in a decent number of "ga" and "fa" observations. Maybe we should just be sampling those labels from the wiki. It depends on how many labels we can gather.

Looks like we can get 289 articles that are tagged either GA or FA via (FA: 128, GA: 161)

I think we should run a follow-up labeling campaign that will get us ~150 total observations in each class. Basically, let's do the exact same sampling strategy we did above, but target 125 observations per strata.

Here's a query that does that:

I've update the query for GA/FA based on @Ladsgroup's notes. It sounded like this was otherwise good to go.

@Ladsgroup, can you give me a campaign name for labeling the final dataset? It'll have 572 pages to label.

Announced it in Persian Wikipedia. Hopefully we will have it done soon.

awight moved this task from Completed to Parked on the Machine-Learning-Team (Active Tasks) board.
awight subscribed.

We're waiting on another iteration of the labeling campaign, there was an error in the first one.

45% done! @Ladsgroup, do you think you could give this another push?

45% done! @Ladsgroup, do you think you could give this another push?

I put this reminder to do a workset every day. Nothing special. It'll be done in one week or two.