Page MenuHomePhabricator

Generate recent article quality scores for English Wikipedia
Closed, ResolvedPublic

Description

  • Develop a score processor that can operate on a last-revision article XML dump
  • Run the processor and obtain scores

Include columns:

  • page_id
  • title
  • revision_id
  • prediction (Stub, Start, C, B, GA, FA)
  • weighted_sum (a weighted sum of probabilities that allows us inter-class measurements)

Re. weighted_sum, see https://github.com/halfak/quality-bias/blob/master/score_revisions.py#L88 This strategy seems to produce stable measurements.

Download link: https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/wp10-scores-enwiki-20160820.tsv.bz2

Event Timeline

Halfak triaged this task as High priority.Jul 5 2016, 2:30 PM

https://github.com/wiki-ai/wikiclass/pull/25/files

I want to run it on ores-compute on https://dumps.wikimedia.org/enwiki/20160820/ but it ran out of space after downloading 35% of the articles dump (12 GB). If we clean this instance, I can run the script to generate them all.

enwiki-20160901-pages-articles.xml.bz2 is 6.0 GB

Use the /srv mount.

$ df -h /srv
Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/vd-second--local--disk  139G  8.5G  123G   7% /srv

The generating scores is being ran on stat1003. Results are in /home/ladsgroup/wp10-scores-enwiki-20160820.tsv.bz2 ETA: one day

Okay, the results are ready, I'm looking for place to put the dump.

Looks great -- except that it still has the integer mapping of the predicted class. We can address that in some future work.

I explained in the PR that I did it because of storage and performance for db and the dump file. In database 'Stub' (if we go with varchar, it'll get a little bit bigger) is much bigger than 0 and then multiply it with 2 million. Also doing query via int is much faster than varchar.

It looks like there is a problem. There's only 664k lines in the file and then the bzip2 detects an error.

$ bzcat wp10-scores-enwiki-20160820.tsv.bz2 | wc

bzcat: Compressed file ends unexpectedly;
	perhaps it is corrupted?  *Possible* reason follows.
bzcat: Success
	Input file = wp10-scores-enwiki-20160820.tsv.bz2, output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

 664787 4353788 37785000

Re. the integer mapping, that's a fine point, but (1) that breaks the spec of the card and (2), the optimization seems premature. After all, we could always use an enum field when loading it into the database. See http://dev.mysql.com/doc/refman/5.7/en/enum.html

It looks like there is a problem. There's only 664k lines in the file and then the bzip2 detects an error.

I recovered, and decompressed and compressed it again. Works as excpected. 5220219 lines which all of them except the first one starts with number. Uploading it right now.

Re. the integer mapping, that's a fine point, but (1) that breaks the spec of the card and (2), the optimization seems premature. After all, we could always use an enum field when loading it into the database. See http://dev.mysql.com/doc/refman/5.7/en/enum.html

In case of using ENUM, if we want to add new class, we need to do a schema change. Correct me if I'm wrong.

Re. the integer mapping, that's a fine point, but (1) that breaks the spec of the card and (2), the optimization seems premature. After all, we could always use an enum field when loading it into the database. See http://dev.mysql.com/doc/refman/5.7/en/enum.html

In case of using ENUM, if we want to add new class, we need to do a schema change. Correct me if I'm wrong.

Yes, that's correct. Don't use ENUM unless you know what you're doing, and even then, don't use ENUM because there are other issues with it too. categorylinks.cl_type is an ENUM and it made me hate ENUMs forever, because of how MySQL behaves when you do things like ORDER BY cl_type or cl_type > 'X'.

I recommend using either numbers, or maybe short strings, but using short strings requires a schema change (once) at this point because the table has been created and populated already. If you're concerned about not being able to interpret the data from the DB without looking at the string<->number mapping in the config, then 1) welcome to MediaWiki namespaces :P and 2) you could consider adding a table that maps names to numerical IDs.

FYI, @EBernhardson & @TJones. This dataset is ready for review. We generated it with the 20160820 XML dump on a single CPU core overnight. I think we shouldn't have much trouble keeping a dataset like this up to date. Let us know if you have any concerns about the format.

Thanks, @Ladsgroup & @Halfak! @dcausse should also be quite interested.

From our dataset:

predictedcount
Stub2,385,285
Start1,708,221
C774,957
B172,844
GA145,058
FA33,858

From https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Statistics

assessmentcount
unassessed509,462
Stub2,893,923
Start1,428,340
C230,204
B108,892
GA27,118
FA5,820

So glad to finally have these predictions. 5x more good articles than manually assessed. The AQ dataset is going to be a goldmine.

(145058 + 33858) / (27118 + 5820) = 5.43 X as many GA+ articles