Page MenuHomePhabricator

Expose rdf-streaming-updater.mutation content through EventStreams
Open, MediumPublic

Description

As a consumer of the wikidata content I want to be able to have access to the same RDF data the WMF WDQS servers use to perform their live updates so that I can keep my own replica of the wikidata query service (or another RDF store) up to date more easily.

A solution might be to use the EventStreams service.

Note on the stream:
It was decided to go fully active/active for the flink application powering the WDQS updater. Which means the complete stream of changes is available in both topics:

  • eqiad.rdf-streaming-updater.mutation
  • codfw.rdf-streaming-updater.mutation

It is sligthly different to what we currently see in our topic topology where if you want to have a complete view of the data you need to consume both eqiad.topic and codfw.topic. Here you must consume only one.

AC:

  • RDF data is exposed through EventStreams
  • A java client is offered for third parties to use with stores comptatible with the SPARQL 1.1 Update operations

Event Timeline

dcausse added a project: EventStreams.
dcausse updated the task description. (Show Details)
dcausse added a subscriber: Ottomata.

Here you must consume only one.

Should we expose both?

We'll need to declare these streams in EventStreammConfig, likely each as a distinct stream overriding the list of topics that make up the stream.

Then we can edit EventStreams helmfile configs to expose those streams.

Oh, we'll also want to add create event schema and add it to the schema repo.

MPhamWMF moved this task from Incoming to Scaling on the Wikidata-Query-Service board.
odimitrijevic lowered the priority of this task from High to Medium.Oct 25 2021, 4:03 PM
odimitrijevic moved this task from Incoming to Event Platform on the Analytics board.

Here you must consume only one.

Actually, this is curious. These are really distinct streams. We probably should have named them differently. We can't just pick eqiad.rdf-streaming-updater.mutation when the user connects to EventStreams in eqiad, because the next time around they might connect to EventStreams in codfw, depending on how they are routed.

So, I think either we expose only one of these topics, OR, we expose them both but as different distinct streams, and the user has to pick one.

(Oh, past me said this already... :p)

Thanks @Gehel for catching the duplicate.

Suggested edits for this (merged) issue:

  • The merged issue should be for making public the data WDQS uses for live updates. A specific implementation such as "...using EventStreams" could be part of the discussion, or even a child task (trying that out as a way to make the data public).
  • @MPhamWMF set T330521 as high priority, and previously set this issue as high priority. If there is disagreement about that, it would be helpful to write a few words about that. Lowering the barrier to mirroring is critical for those maintaining live mirrors, and less so for those only working on WD and WDQS core. But committing to make this work could open avenues to share WDQS loads which would affect core performance and disaster planning (the context in which this came up recently).

I just looked into https://github.com/wikimedia/wikidata-query-rdf, which provides a tool runUpdate.sh. When I run it for a Blazegraph instance with exactly one triple of the form <http://www.wikidata.org> <http://schema.org/dateModified> "2024-02-11T05:42Z"^^xsd:dateTime, it will continuously update the instance with all changes since that date. I have two questions:

  1. Which API is this script (or rather the underlying Update.java) calling to get the updates since a particular date?
  1. Apparently, this API is public. From an earlier communication, I got the impression that there is no public API for this yet. What did I misunderstand?
  1. Which API is this script (or rather the underlying Update.java) calling to get the updates since a particular date?

It is retrieving updates from https://wikidata.org/wiki/Special:RecentChanges

  1. Apparently, this API is public. From an earlier communication, I got the impression that there is no public API for this yet. What did I misunderstand?

runUpdate.sh does not use this modern updater. It synchronizes with Wikidata's live recent changes.

@Harej Thanks for the quick reply, James! Are you saying that the script is scraping + parsing https://www.wikidata.org/wiki/Special:RecentChanges to obtain the triples to be added and deleted? Or is there a different way to access that page, which gives you the added and deleted triples in a more machine-friendly format?

AFAIK: The “legacy” updater queries the recent changes via the API, then gets the RDF for the edited entities from Special:EntityData, and compares that to the live data in the query service to determine which triples need to be added and removed. (This includes some logic to clean up “orphaned” nodes like statements, references or full values.) This should work for any Wikibase, but it’s inefficient, which is why we no longer use it in production.