Page MenuHomePhabricator

Design the bulk ingestion/indexing pipeline for the new triplestore
Open, Needs TriagePublic5 Estimated Story Points

Description

WDP team will consult with DPE team to get feedback on the best way to set up the bulk data ingestion we need.

We need this for a few reasons:

  • reindexing should benefit the performance of the backend
  • automatic reconciliation: occasionally update events don't make it all the way through the Kafka streaming-update flow and need to go through a fairly complex reconciliation flow to eventually be reflected in the triplestore; bulk data loading could be an easier way to reach the same state
    • The current setup here works, it's just maybe more complex than we need
  • manual reconciliation: it also happens occasionally that updates never even hit the Kafka flow; in this case, we need to manually reconcile them after a user notices. This is not a good user experience and the problem would be solved entirely by bulk data reloads.

We currently have a data pipeline that generates a dataset that can be indexed. There is, however, no way to move the data from HDFS to the WDQS nodes where the triplestore lives. Currently an SRE needs to manually do this.

Ideally we would be able to do this via Airflow, but can we do the whole flow there? (Depool a node -> move the data -> reindex -> Repool the node)

We are consulting with DPE later in April. This task tracks the design that will come out of this meeting. Implementation will be covered by follow-up tasks.