Page MenuHomePhabricator

Write a client that consumes the RDF update stream from https://stream.wikimedia.org/ and update a triple store
Open, Needs TriagePublic

Description

In the wdqs code we consume the RDF update stream from kafka using the KafkaStreamConsumer class. A similar implementation should be written to work on top of HTTP EventStreams.

The features it must provide are:

  • implement StreamConsumer
  • offsets handling and persistance (which is provided when consuming directly from kafka)
    • it knows what to do on the first run (infer the initial offset possibly using the triple store itself scanning select (min(?date) as ?start) { wikibase:Dump schema:dateModified ?date } LIMIT 1)
    • it knows how to resume operations
  • Adapt or add a new main to run it based on a set of parameters
  • Use the same batching/compression technique (see PatchAccumulator)
  • ideally populate the same set of metrics

AC:

  • a triple compatible with SPARQL 1.1 Update operations and loaded with a munged wikidata dump can be updated outside of the WMF infrastructure using HTTP event streams.

Event Timeline

Good point. IIRC we wrap and support this API in Python.

If feasible, I'd lean towards decoupling this consumer from wikimedia-event-utilities to facilitate adoption outside of WMF.