As an operator of WDQS I want to be able to reload data in a reasonable time, so that I can react to potential issues in a timely manner.
At this point, the full data import takes more than a week, which means it is difficult to reload data when needed (issues with synchronization, reimaging of servers, ...).
Tracking task to collect all the efforts made in this direction.
Times of the past imports so that we can see the improvements:
start | end | dump | node | munge time | import time | initial lag | time to catchup | triples imported from dumps |
2019-12-04 | 2020-01-01 | wikidata-20191202-all-BETA.ttl.bz2 | wdqs1010 | 22.85h[1] | 191h (8days) | 2 weeks | 358h (15days) | |
2020-03-10 | ?? | wikidata-20200302-all-BETA.ttl.bz2 | wdqs1010 | 19h 20' | ?? | ?? | ?? | |
2020-09-09 | ?? | wikidata-20200824-all-BETA.ttl.bz2 | wdqs1009 | 25h 30' | (start 2020-09-10T11:00) ?? | ?? | ?? | |
2021-10-01 | 2021-10-14 | wikidata-20210927-all-BETA.ttl.bz2 | wdqs2008 | 27h 00' | 11days | 2.2weeks | 22h[2] | |
2021-10-01 | 2021-10-14 | wikidata-20210927-all-BETA.ttl.bz2 | wdqs1009 | 29h 00' | 12days | 2.34weeks | 24h20[2] | |
2023-01-06 | 2023-01-20 | wikidata-20230102-all-BETA.ttl.bz2 | wdqs1009 | 31h 15' | failed 14 days later on wikidump-000001040 | failed | failed | |
2023-02-02 | 2023-02-20 | wikidata-20230130-all-BETA.ttl.bz2 + wikidata-20230203-lexemes-BETA.ttl.bz2 | wdqs1010 | 32h 00' | 13days + 18h | 4.5weeks[3] | 50h | 14.5B |
2023-10-25 | 2023-11-20 | wikidata-20231016-all-BETA.ttl.bz2 + wikidata-20231013-lexemes-BETA.ttl.bz2 | wdqs1022 | 3d | 22d, 13h | N/A | N/A | 15,320,277,615 |
2024-06-13 | 2024-06-22 | hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/ | wdqs2023[4] | 0[5] | 7days + 15h | 2.9weeks | 31h | 15.9B |
[1] munge times improved to 12.18hours in T238002
[2] using the streaming updater
[3] mistake was made setting up kafka offsets manually should have been only 17days to catch up
[4] cpu perf governor enabled T336443
[5] dumps are pre-munged in hdfs
H/W:
host | CPU | mem | disk |
wdqs1010,wdqs1009 | 2x E5-2620 v4 @ 2.10GHz | 128Gb | 4x SSD 800Gb in raid0 (md) |
wdqs2008 | 2x Silver 4215 @ 2.50GHz | 128Gb | 4x SSD 960Gb in raid10 (md) |
wdqs1022 | 2x Silver 4314 @ 2.40GHz | 128Gb | 4x SSD 1.92TB in raid0 (md) spec |
wdqs2023 | 2x Silver 4314 CPU @ 2.40GHz | 128Gb | 4x SSD 1.92TB in raid0 (md) |
Various efforts:
- https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/532373 (TODO: create a dedicated task for it with some explanation of the strategy)