As an operator of WDQS I want to be able to reload data in a reasonable time, so that I can react to potential issues in a timely manner.
At this point, the full data import takes more than a week, which means it is difficult to reload data when needed (issues with synchronization, reimaging of servers, ...).
Tracking task to collect all the efforts made in this direction.
Times of the past imports so that we can see the improvements:
| start | end | dump | node | munge time | import time | initial lag | time to catchup
| 2019-12-04 | 2020-01-01 | wikidata-20191202-all-BETA.ttl.bz2 | wdqs1010 | 22.85h[1] | 191h (8days) | 2 weeks | 358h (15days)|
| 2020-03-10 | ?? | wikidata-20200302-all-BETA.ttl.bz2 | wdqs1010 | 19h 20' | ?? | ?? | ?? |
| 2020-09-09 | ?? | wikidata-20200824-all-BETA.ttl.bz2 | wdqs1009 | 25h 30' | (start 2020-09-10T11:00) ?? | ?? | ?? |
[1] munge times improved to 12.18hours in T238002
Various efforts:
- https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/532373 (TODO: create a dedicated task for it with some explanation of the strategy)