We've had a number of cases where wdqs-updater was either lagging because of load on blazegraph or causing issues on its own, affecting blazegraph, or at least the shared servers. A number of operations done by updater could be shared between servers, thus reducing the processing power needed and reducing the load on other services.
At high level, the updater process is:
- get a stream of wikidata changes (either from Recent Changes API or by filtering Kafka events)
- deduplicate those events over a period of time
- enrich them with the actual data changed by querying Wikidata API
- batching the enriched changes to apply them to blazegraph
All this is a fairly standard event sourcing pattern.
The event stream is the same for all servers, so step 1), 2) and 3) could be shared, and they don't have any direct dependency on blazegraph. Step 4) needs to be done for each wdqs blazegraph instance.
- we need to be able to replay events over some period of time (~2 weeks) during data load, data is loaded from a wikidata dump, and then updater process is used to catch up on event occurring after the dump
- some level of ordering is required
It looks like k8s would be a reasonable place to run such a service. A single instance of the service would be needed as some shared state is required for deduplication and ordering. After step 3), events could be sent to another kafka topic. Step 4) would be a simplified updater, running on each wdqs node.
I'm probably missing a few things, feedback on the proposal is welcomed!