Page MenuHomePhabricator

Make sure Wikidata entity dump scripts run for only about 1-2hours
Closed, ResolvedPublic

Description

Make sure Wikidata entity dump scripts run for only a limited time, by using the new features introduced in T177550: Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1).

Steps:

  • Find out how many entities/pages we need to include, so that the scripts run for about 1-2 hours. -> 400,000 per run (initially)
  • Wait for at least 1.31.0-wmf.28, better 1.31.0-wmf.29 (so that we're safe from branch rollbacks) to be deployed.
  • Adapt the JSON dump bash-scripts for this.
  • Adapt the RDF dump bash-scripts for this. Also use --part-id as needed (T185589).

Event Timeline

hoo triaged this task as High priority.Mar 23 2018, 12:59 PM
hoo created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2018, 12:59 PM
hoo updated the task description. (Show Details)Apr 2 2018, 1:34 PM
hoo added a comment.EditedApr 10 2018, 8:27 AM
  • 20180402 JSON dump: Each shard dumped about 7.70m entities in (very roughly) 40h.
  • 20180326 JSON dump: Each shard dumped about 7.65m entities in (very roughly) 35h.
  • 20180319 JSON dump: Each shard dumped about 7.63m entities in (very roughly) 34h.
  • 20180312 JSON dump: Each shard dumped about 7.57m entities in (very roughly) 40.5h.
  • 20180305 JSON dump: Each shard dumped about 7.57m entities in (very roughly) 50.5h.
hoo added a comment.Apr 10 2018, 8:35 AM
  • 20180402 TTL dump: Each shard dumped about 8.01m entities in (very roughly) 49h.
  • 20180326 TTL dump: Each shard dumped about 7.95m entities in (very roughly) 35h.
  • 20180319 TTL dump: Each shard dumped about 7.91m entities in (very roughly) 45h.
hoo added a comment.EditedApr 10 2018, 8:41 AM
  • 20180405 truthy-nt dump: Each shard dumped about 8.04m entities in (very roughly) 60h.
  • 20180328 truthy-nt dump: Each shard dumped about 7.96m entities in (very roughly) 47h.
  • 20180323 truthy-nt dump: Each shard dumped about 7.93m entities in (very roughly) 47h.
hoo added a comment.EditedApr 10 2018, 8:48 AM

Velocity (taken as average from the three runs listed above):

  • JSON: 214k entities/hour
  • TTL: 189k entities/hour
  • truthy-nt: 157k entities/hour

Due to this, I suggest to always run roughly 400k page ids per script run (considering there are possibly missing ones, pages in other namespaces, …).

hoo claimed this task.Apr 10 2018, 8:49 AM
hoo moved this task from Tasks to In Progress on the Wikidata-Ministry-Of-Magic board.

Looks like a good first estimate to me. Remember these things can always be tweaked later.

hoo updated the task description. (Show Details)Apr 10 2018, 9:59 AM

Per request, how to get the max page id; a boring python script that uses the standard dumps mechanisms (config file processing, digging out the db username and password) to get the value: P6974

hoo added a comment.Apr 10 2018, 11:01 AM

I just noticed that we could also use:
php maintenance/sql.php --wiki wikidatawiki --json --query 'SELECT MAX(page_id) AS max_page_id FROM page' | grep max_page_id | grep -oP '\d+'

That's maybe simpler for just getting this one bit of information.

hoo added a comment.Apr 10 2018, 11:29 AM

The dump script calls will basically look like this soon:

php repo/maintenance/dumpJson.php --wiki wikidatawiki  --first-page-id `expr $i \* 400000 \* $shards + 1` --last-page-id `expr \( $i + 1 \) \* 400000 \* $shards`
hoo updated the task description. (Show Details)Apr 12 2018, 9:06 AM

Change 425926 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] [WIP] Wikidata JSON dump: Only dump batches of ~400,000 pages at once

https://gerrit.wikimedia.org/r/425926

Change 425926 merged by ArielGlenn:
[operations/puppet@production] Wikidata JSON dump: Only dump batches of ~400,000 pages at once

https://gerrit.wikimedia.org/r/425926

hoo updated the task description. (Show Details)Apr 30 2018, 9:57 AM
hoo moved this task from Blocked to In Progress on the Wikidata-Ministry-Of-Magic board.

Change 430395 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Wikidata entity dumps: Move generic parts into functions

https://gerrit.wikimedia.org/r/430395

Change 430585 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Create RDF dumps in batches, not all at once

https://gerrit.wikimedia.org/r/430585

Change 430395 merged by ArielGlenn:
[operations/puppet@production] Wikidata entity dumps: Move generic parts into functions

https://gerrit.wikimedia.org/r/430395

Change 430585 merged by ArielGlenn:
[operations/puppet@production] Create RDF dumps in batches, not all at once

https://gerrit.wikimedia.org/r/430585

hoo closed this task as Resolved.May 3 2018, 1:20 PM
hoo removed a project: Patch-For-Review.
hoo updated the task description. (Show Details)
hoo moved this task from In Progress to Done on the Wikidata-Ministry-Of-Magic board.