Generate monthly article quality dataset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 14 2016, 3:33 PM

Description

enwiki
frwiki
ruwiki

Related Objects

Mentioned In: T146718: [Discuss] Hosting the monthly article quality dataset on labsDB
T146284: Generate a monthly pageviews dataset

Event Timeline

Halfak created this task.Sep 14 2016, 3:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 14 2016, 3:33 PM

https://github.com/wiki-ai/wikiclass/pull/28

Checking in. I wish that I hadn't turned on verbose mode (which is way too verbose) for our monthly article quality extraction process. I'd be able to look at INFO log lines to see how we're progressing on processing dump files.

Right now, i can only say that we've got 291M article quality assessments. We might end up with 360M if my conservative estimate is about right.

Halfak mentioned this in T146284: Generate a monthly pageviews dataset.Sep 21 2016, 3:26 PM

Stat1003 got a reboot, so I'm trying to pick up where I left off.

English is done. See https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/enwiki-20160801.wp10.monthly.tsv.bz2

I just started up the frwiki extractor

Stub on figshare: https://figshare.com/account/projects/16182/articles/3859800

Halfak mentioned this in T146718: [Discuss] Hosting the monthly article quality dataset on labsDB.Sep 26 2016, 11:24 PM

I just uploaded the cleaned and compressed enwiki dataset to figshare

frwiki is up to 66,833,544 article/month assessments

OK. Done with French. Starting up Russian

frwiki dataset uploaded to https://figshare.com/articles/Monthly_Wikipedia_article_quality_predictions/3859800

All datasets are here: https://datasets.wikimedia.org/public-datasets/all/wp10/20160801/

I'm traveling so it's hard to upload to figshare. I'll do that upload when I'm on a better connection.

Halfak closed this task as Resolved.Oct 11 2016, 11:51 PM

Generate monthly article quality datasetClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Generate monthly article quality dataset
Closed, ResolvedPublic
Actions