Page MenuHomePhabricator

EPIC: Reduce the time needed to do the initial WDQS import
Open, MediumPublic


As an operator of WDQS I want to be able to reload data in a reasonable time, so that I can react to potential issues in a timely manner.

At this point, the full data import takes more than a week, which means it is difficult to reload data when needed (issues with synchronization, reimaging of servers, ...).

Tracking task to collect all the efforts made in this direction.

Times of the past imports so that we can see the improvements:

startenddumpnodemunge timeimport timeinitial lagtime to catchup
2019-12-042020-01-01wikidata-20191202-all-BETA.ttl.bz2wdqs101022.85h[1]191h (8days)2 weeks358h (15days)
2020-03-10??wikidata-20200302-all-BETA.ttl.bz2wdqs101019h 20'??????
2020-09-09??wikidata-20200824-all-BETA.ttl.bz2wdqs100925h 30'(start 2020-09-10T11:00) ??????
2021-10-012021-10-14wikidata-20210927-all-BETA.ttl.bz2wdqs200827h 00'11days2.2weeks22h[2]
2021-10-012021-10-14wikidata-20210927-all-BETA.ttl.bz2wdqs100929h 00'12days2.34weeks24h20[2]

[1] munge times improved to 12.18hours in T238002
[2] using the streaming updater

Various efforts:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse updated the task description. (Show Details)
Gehel added a subtask: Unknown Object (Task).Mar 18 2020, 1:12 PM
Gehel removed a subtask: Unknown Object (Task).Apr 20 2020, 3:11 PM
Gehel triaged this task as High priority.Sep 15 2020, 7:41 AM
Gehel lowered the priority of this task from High to Medium.Sep 30 2020, 1:52 PM
Gehel added a subscriber: Gehel.

With our new streaming updater, the constraint are going to change. Let's revisit once the streaming updater is ready.

Gehel raised the priority of this task from Medium to High.Feb 19 2021, 10:04 AM
Gehel lowered the priority of this task from High to Medium.Jun 10 2021, 2:35 PM

@Gehel I see the comment above to review this after the streaming updater is launched. Is this still valid?