Page MenuHomePhabricator

Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1)
Closed, ResolvedPublic

Description

We should only dump up to N entities in each maintenance script run, and then start a new dumper instance at that offset.

This has several benefits:

  1. If a script fails, we just need to redo the last N entity batch and not the whole thing.
  2. We can (with some grace time) nicely react in case a DB etc. goes down/ changes (and even if no grace time is given, 1. helps here)
  3. All shards will be equally fast (because they will switch DB replicas/ external storage replicas throughout, so picking a slower one at some point doesn't have as much effect)
  4. Memory leaks and other long-running PHP with MediaWiki things don't bite us as hard

I suggest to pick N so that a dumper runs for about 15-30m, before exiting and handing over to the next runner.

One problem that needs solving, or we at least need to be aware: If a new Wikibase version gets deployed mid-dump, the serialization format might not be consistent within a single dump.

Event Timeline

hoo created this task.Oct 5 2017, 10:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2017, 10:24 PM
hoo updated the task description. (Show Details)Oct 16 2017, 9:53 AM

How is this coming along?

hoo added a comment.Jan 16 2018, 4:35 PM

How is this coming along?

No work has been done on this yet, but it shouldn't be to hard.

At very least we will need a --continue parameter for the dump maintenance script. Also we will need to find a way to stop… keep dumping in batches of say 100,000 until the batches are empty? How would we know from where to continue? Grep this from the logs or do we want a special "state file" for this that has the entity id last dumped by that shard?

Can you pass the script a first and last entity id and get around the problem that way?

hoo added a comment.Jan 22 2018, 1:56 AM

Can you pass the script a first and last entity id and get around the problem that way?

I looked into this, but I don't think it is viable: We use the page id to paging internally, and a higher entity id doesn't imply a higher page id (most of the time, but not always). Due to this I chose to implement this using offsets.

Change 405739 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Allow continuing Wikibase entity dumps

https://gerrit.wikimedia.org/r/405739

Change 405739 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow continuing Wikibase entity dumps

https://gerrit.wikimedia.org/r/405739

WMDE-leszek renamed this task from Only dump up to N entities in each maintenance script run to Only dump up to N entities in each maintenance script run (days: 1).Jan 23 2018, 4:53 PM
hoo added a comment.Jan 23 2018, 6:09 PM

Potential blocker for doing this for the RDF dumps: T185589: Repeating blank node ids in Wikidata entity RDF dumps.

Lydia_Pintscher closed this task as Resolved.Jan 30 2018, 3:04 PM
hoo reopened this task as Open.Mar 15 2018, 10:19 PM

@ArielGlenn approached me about this, and we agreed that it would be better to not say we want to dump up to N entities, but rather specify a page_id range. So saying we want to dump all pages in the range [n, m] that match the current shard number.

hoo renamed this task from Only dump up to N entities in each maintenance script run (days: 1) to Only dump up entities in a certain page_id range in each maintenance script run (days: 1).Mar 15 2018, 10:21 PM
hoo removed a project: Patch-For-Review.

Change 420629 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] DumpEntities: Allow dumping a specific range of page ids

https://gerrit.wikimedia.org/r/420629

Change 420629 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] DumpEntities: Allow dumping a specific range of page ids

https://gerrit.wikimedia.org/r/420629

hoo closed this task as Resolved.Mar 23 2018, 12:46 PM
hoo claimed this task.
hoo renamed this task from Only dump up entities in a certain page_id range in each maintenance script run (days: 1) to Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1).Mar 23 2018, 12:50 PM
hoo removed a project: Patch-For-Review.