Page MenuHomePhabricator

Make sure Wikidata entity dump scripts run for only about 1-2hours
Closed, ResolvedPublic

Description

Make sure Wikidata entity dump scripts run for only a limited time, by using the new features introduced in T177550: Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1).

Steps:

  • Find out how many entities/pages we need to include, so that the scripts run for about 1-2 hours. -> 400,000 per run (initially)
  • Wait for at least 1.31.0-wmf.28, better 1.31.0-wmf.29 (so that we're safe from branch rollbacks) to be deployed.
  • Adapt the JSON dump bash-scripts for this.
  • Adapt the RDF dump bash-scripts for this. Also use --part-id as needed (T185589).

Event Timeline

hoo triaged this task as High priority.Mar 23 2018, 12:59 PM
hoo created this task.
  • 20180402 JSON dump: Each shard dumped about 7.70m entities in (very roughly) 40h.
  • 20180326 JSON dump: Each shard dumped about 7.65m entities in (very roughly) 35h.
  • 20180319 JSON dump: Each shard dumped about 7.63m entities in (very roughly) 34h.
  • 20180312 JSON dump: Each shard dumped about 7.57m entities in (very roughly) 40.5h.
  • 20180305 JSON dump: Each shard dumped about 7.57m entities in (very roughly) 50.5h.
  • 20180402 TTL dump: Each shard dumped about 8.01m entities in (very roughly) 49h.
  • 20180326 TTL dump: Each shard dumped about 7.95m entities in (very roughly) 35h.
  • 20180319 TTL dump: Each shard dumped about 7.91m entities in (very roughly) 45h.
  • 20180405 truthy-nt dump: Each shard dumped about 8.04m entities in (very roughly) 60h.
  • 20180328 truthy-nt dump: Each shard dumped about 7.96m entities in (very roughly) 47h.
  • 20180323 truthy-nt dump: Each shard dumped about 7.93m entities in (very roughly) 47h.

Velocity (taken as average from the three runs listed above):

  • JSON: 214k entities/hour
  • TTL: 189k entities/hour
  • truthy-nt: 157k entities/hour

Due to this, I suggest to always run roughly 400k page ids per script run (considering there are possibly missing ones, pages in other namespaces, …).

Looks like a good first estimate to me. Remember these things can always be tweaked later.

Per request, how to get the max page id; a boring python script that uses the standard dumps mechanisms (config file processing, digging out the db username and password) to get the value: P6974

I just noticed that we could also use:
php maintenance/sql.php --wiki wikidatawiki --json --query 'SELECT MAX(page_id) AS max_page_id FROM page' | grep max_page_id | grep -oP '\d+'

That's maybe simpler for just getting this one bit of information.

The dump script calls will basically look like this soon:

php repo/maintenance/dumpJson.php --wiki wikidatawiki  --first-page-id `expr $i \* 400000 \* $shards + 1` --last-page-id `expr \( $i + 1 \) \* 400000 \* $shards`

Change 425926 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] [WIP] Wikidata JSON dump: Only dump batches of ~400,000 pages at once

https://gerrit.wikimedia.org/r/425926

Change 425926 merged by ArielGlenn:
[operations/puppet@production] Wikidata JSON dump: Only dump batches of ~400,000 pages at once

https://gerrit.wikimedia.org/r/425926

Change 430395 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Wikidata entity dumps: Move generic parts into functions

https://gerrit.wikimedia.org/r/430395

Change 430585 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Create RDF dumps in batches, not all at once

https://gerrit.wikimedia.org/r/430585

Change 430395 merged by ArielGlenn:
[operations/puppet@production] Wikidata entity dumps: Move generic parts into functions

https://gerrit.wikimedia.org/r/430395

Change 430585 merged by ArielGlenn:
[operations/puppet@production] Create RDF dumps in batches, not all at once

https://gerrit.wikimedia.org/r/430585

hoo removed a project: Patch-For-Review.
hoo updated the task description. (Show Details)
hoo moved this task from In Progress to Done on the Wikidata-Ministry-Of-Magic board.