**THIS IS A DRAFT**
The purpose of this task is to document observed Wikidata dump import characteristics with different hardware, using the graph database //Blazegraph// that is employed by the Wikidata Query Service. It is also to understand how this performance stacks up against an import into the managed Amazon Web Services ("AWS") offering //Amazon Neptune//. Blazegraph's company Systap was brought into Amazon several years back.
**SUMMARY**
- A current state of the art cloud compute instance approaches the performance characteristics of a 2018 gaming desktop and 2019 MacBook Pro.
- NVMes confer a speed advantage for Blazegraph imports relative to SATA SSDs.
- CPU clock speed confers a speed advantage for Blazegraph imports.
- Amazon Neptune is capable of importing a full February 2024 Wikidata dump dramatically faster than with a standalone cloud compute instance or later generation bare metal server in our data center. Instead of 15-25 days (bare metal in one of our data centers) it takes approximately 63 hours (2.625 days).
- There are cost considerations.
**NARRATIVE**
As noted in {T336443}, full Wikidata imports are slower on newer servers provisioned with some later generation processors, in contrast with some older servers with older generation processors. The latest full import appeared to take 25 days.
It's important to address this speed challenge in order to ensure that newer hardware is capable of repopulating Blazegraph quickly in the event of a system failure. Additionally, faster imports also enable us to conduct experiments for graph splitting approaches more quickly - we have been targeting a load time below 10 days, which is achievable for graphs of about 7.6B triples but we believe will likely be difficult to achieve for graphs of 10B triples or more with the newer nodes. Finally, in the event of a catastrophic system failure, we want to understand if the current state of the art, for example as exhibited with Amazon Neptune, may still provide adequate import capabilities.
The following prior art is useful for understanding similar analysis.
- https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/
- https://hal.science/hal-03132794/document
- https://wikidataworkshop.github.io/2022/papers/Wikidata_Workshop_2022_paper_4558.pdf
More to be ported from text files, Etherpad, etc.