Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1)
Closed, ResolvedPublic
Actions

Description

We should only dump up to N entities in each maintenance script run, and then start a new dumper instance at that offset.

This has several benefits:

If a script fails, we just need to redo the last N entity batch and not the whole thing.
We can (with some grace time) nicely react in case a DB etc. goes down/ changes (and even if no grace time is given, 1. helps here)
All shards will be equally fast (because they will switch DB replicas/ external storage replicas throughout, so picking a slower one at some point doesn't have as much effect)
Memory leaks and other long-running PHP with MediaWiki things don't bite us as hard
…

I suggest to pick N so that a dumper runs for about 15-30m, before exiting and handing over to the next runner.

One problem that needs solving, or we at least need to be aware: If a new Wikibase version gets deployed mid-dump, the serialization format might not be consistent within a single dump.

Details

	Subject	Repo	Branch	Lines +/-
	DumpEntities: Allow dumping a specific range of page ids	mediawiki/extensions/Wikibase	master	+84 -7
	Allow continuing Wikibase entity dumps	mediawiki/extensions/Wikibase	master	+63 -12

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Resolved	hoo	T177550 Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1)

Event Timeline

hoo created this task.Oct 5 2017, 10:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2017, 10:24 PM

hoo mentioned this in T177486: [Tracking] Wikidata entity dumpers need to cope with the immense Wikidata growth recently.Oct 5 2017, 10:25 PM

hoo updated the task description. (Show Details)Oct 16 2017, 9:53 AM

Liuxinyu970226 subscribed.Oct 26 2017, 3:42 AM

How is this coming along?

In T177550#3901012, @ArielGlenn wrote:

How is this coming along?

No work has been done on this yet, but it shouldn't be to hard.

At very least we will need a --continue parameter for the dump maintenance script. Also we will need to find a way to stop… keep dumping in batches of say 100,000 until the batches are empty? How would we know from where to continue? Grep this from the logs or do we want a special "state file" for this that has the entity id last dumped by that shard?

Can you pass the script a first and last entity id and get around the problem that way?

In T177550#3904087, @ArielGlenn wrote:

Can you pass the script a first and last entity id and get around the problem that way?

I looked into this, but I don't think it is viable: We use the page id to paging internally, and a higher entity id doesn't imply a higher page id (most of the time, but not always). Due to this I chose to implement this using offsets.

Change 405739 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Allow continuing Wikibase entity dumps

https://gerrit.wikimedia.org/r/405739

gerritbot added a project: Patch-For-Review.Jan 22 2018, 5:47 PM

hoo added a project: Wikidata-Sprint-2018-01-17.Jan 22 2018, 5:51 PM

hoo moved this task from Backlog to Review on the Wikidata-Sprint-2018-01-17 board.Jan 22 2018, 6:35 PM

Lucas_Werkmeister_WMDE subscribed.Jan 23 2018, 11:57 AM

Change 405739 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow continuing Wikibase entity dumps

https://gerrit.wikimedia.org/r/405739

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)).Jan 23 2018, 4:00 PM

WMDE-leszek renamed this task from Only dump up to N entities in each maintenance script run to Only dump up to N entities in each maintenance script run (days: 1).Jan 23 2018, 4:53 PM

Ladsgroup moved this task from Review to Done on the Wikidata-Sprint-2018-01-17 board.Jan 23 2018, 6:07 PM

Potential blocker for doing this for the RDF dumps: T185589: Repeating blank node ids in Wikidata entity RDF dumps.

Lydia_Pintscher moved this task from incoming to in progress on the Wikidata board.Jan 30 2018, 2:33 PM

Lydia_Pintscher closed this task as Resolved.Jan 30 2018, 3:04 PM

Liuxinyu970226 unsubscribed.Feb 1 2018, 11:54 AM

@ArielGlenn approached me about this, and we agreed that it would be better to not say we want to dump up to N entities, but rather specify a page_id range. So saying we want to dump all pages in the range [n, m] that match the current shard number.

hoo renamed this task from Only dump up to N entities in each maintenance script run (days: 1) to Only dump up entities in a certain page_id range in each maintenance script run (days: 1).Mar 15 2018, 10:21 PM

hoo removed a project: Patch-For-Review.

hoo added a project: Wikidata-Ministry-Of-Magic.Mar 16 2018, 11:53 AM

hoo moved this task from Tasks to In Progress on the Wikidata-Ministry-Of-Magic board.

Change 420629 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] DumpEntities: Allow dumping a specific range of page ids

https://gerrit.wikimedia.org/r/420629

gerritbot added a project: Patch-For-Review.Mar 20 2018, 2:03 AM

hoo moved this task from In Progress to Needs Review on the Wikidata-Ministry-Of-Magic board.Mar 20 2018, 2:11 AM

• RazShuty moved this task from Needs Review to In Progress on the Wikidata-Ministry-Of-Magic board.Mar 21 2018, 11:04 AM

hoo moved this task from In Progress to Needs Review on the Wikidata-Ministry-Of-Magic board.Mar 22 2018, 2:22 AM

Ladsgroup moved this task from Needs Review to Done on the Wikidata-Ministry-Of-Magic board.Mar 23 2018, 11:24 AM

Change 420629 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] DumpEntities: Allow dumping a specific range of page ids

https://gerrit.wikimedia.org/r/420629

hoo closed this task as Resolved.Mar 23 2018, 12:46 PM

hoo claimed this task.

hoo renamed this task from Only dump up entities in a certain page_id range in each maintenance script run (days: 1) to Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1).Mar 23 2018, 12:50 PM

hoo removed a project: Patch-For-Review.

hoo mentioned this in T190513: Make sure Wikidata entity dump scripts run for only about 1-2hours.Mar 23 2018, 12:59 PM

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)); removed MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)).Mar 23 2018, 1:00 PM

ArielGlenn moved this task from Backlog to Done on the Datasets-General-or-Unknown board.Jul 2 2018, 12:38 PM

Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Allow dumping only entities in a certain page_id range in each maintenance script run (days: 1)
Closed, ResolvedPublic
Actions

Related Objects
Search...