As of today the data-reload cookbook does multiple tasks on the wdqs host being reloaded:
- copy the dumps from the snapshot machines to a local folder
- munge
- import into blazegraph
It would be interesting to have a more flexible process that sources its data from hdfs/hive directly (or indirectly via swift?) so that we could reuse the data computed by some jobs running in hadoop (munging, graph splitting).
It is unclear yet how to precisely achieve this but the goal would be to have a set of tools that can be given a wdqs host, a target blazegraph port/namespace, a hive partition for the source data and schedule a data-reload.
Design:
The data-reload cookbook will be adapted to do the following steps:
- From a stat machine: Generate (or re-use) a .nt dataset in hdfs using org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator on an existing hive tables with triples
- From a stat machine: copy the generated partitions to a local folder (possibly rename the partition to match what loadData.sh might expect
- Transfer the files to the destination wdqs host
- Run the data-reload skipping the munge step
- Determine the kafka timestamp by sending a SPARQL query
- Restart the updater
AC:
- the system is designed and this phab task updated
- a wdqs host can be loaded using the triples stored in a hive partition
- the load process can be resumed if it failed (except if blazegraph has corrupted its journal)
- the loading time must be inferior (~ -1day than the classic data-reload because the munge step)