Page MenuHomePhabricator

EPIC: Reduce the time needed to do the initial WDQS import
Open, MediumPublic

Description

As an operator of WDQS I want to be able to reload data in a reasonable time, so that I can react to potential issues in a timely manner.

At this point, the full data import takes more than a week, which means it is difficult to reload data when needed (issues with synchronization, reimaging of servers, ...).

Tracking task to collect all the efforts made in this direction.

Times of the past imports so that we can see the improvements:

startenddumpnodemunge timeimport timeinitial lagtime to catchuptriples imported from dumps
2019-12-042020-01-01wikidata-20191202-all-BETA.ttl.bz2wdqs101022.85h[1]191h (8days)2 weeks358h (15days)
2020-03-10??wikidata-20200302-all-BETA.ttl.bz2wdqs101019h 20'??????
2020-09-09??wikidata-20200824-all-BETA.ttl.bz2wdqs100925h 30'(start 2020-09-10T11:00) ??????
2021-10-012021-10-14wikidata-20210927-all-BETA.ttl.bz2wdqs200827h 00'11days2.2weeks22h[2]
2021-10-012021-10-14wikidata-20210927-all-BETA.ttl.bz2wdqs100929h 00'12days2.34weeks24h20[2]
2023-01-062023-01-20wikidata-20230102-all-BETA.ttl.bz2wdqs100931h 15'failed 14 days later on wikidump-000001040failedfailed
2023-02-022023-02-20wikidata-20230130-all-BETA.ttl.bz2 + wikidata-20230203-lexemes-BETA.ttl.bz2wdqs101032h 00'13days + 18h4.5weeks[3]50h14.5B
2023-10-252023-11-20wikidata-20231016-all-BETA.ttl.bz2 + wikidata-20231013-lexemes-BETA.ttl.bz2wdqs10223d22d, 13hN/AN/A15,320,277,615
2024-06-132024-06-22hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/wdqs2023[4]0[5]7days + 15h2.9weeks31h15.9B

[1] munge times improved to 12.18hours in T238002
[2] using the streaming updater
[3] mistake was made setting up kafka offsets manually should have been only 17days to catch up
[4] cpu perf governor enabled T336443
[5] dumps are pre-munged in hdfs
H/W:

hostCPUmemdisk
wdqs1010,wdqs10092x E5-2620 v4 @ 2.10GHz128Gb4x SSD 800Gb in raid0 (md)
wdqs20082x Silver 4215 @ 2.50GHz128Gb4x SSD 960Gb in raid10 (md)
wdqs10222x Silver 4314 @ 2.40GHz128Gb4x SSD 1.92TB in raid0 (md) spec
wdqs20232x Silver 4314 CPU @ 2.40GHz128Gb4x SSD 1.92TB in raid0 (md)

Various efforts:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse updated the task description. (Show Details)
Gehel added a subtask: Unknown Object (Task).Mar 18 2020, 1:12 PM
Gehel removed a subtask: Unknown Object (Task).Apr 20 2020, 3:11 PM
Gehel triaged this task as High priority.Sep 15 2020, 7:41 AM
Gehel lowered the priority of this task from High to Medium.Sep 30 2020, 1:52 PM
Gehel subscribed.

With our new streaming updater, the constraint are going to change. Let's revisit once the streaming updater is ready.

Gehel raised the priority of this task from Medium to High.Feb 19 2021, 10:04 AM
Gehel lowered the priority of this task from High to Medium.Jun 10 2021, 2:35 PM

@Gehel I see the comment above to review this after the streaming updater is launched. Is this still valid?

dcausse updated the task description. (Show Details)
dcausse updated the task description. (Show Details)