Page MenuHomePhabricator

wdqs updater should be better isolated from blazegraph and common workload should be shared between servers
Open, NormalPublic


We've had a number of cases where wdqs-updater was either lagging because of load on blazegraph or causing issues on its own, affecting blazegraph, or at least the shared servers. A number of operations done by updater could be shared between servers, thus reducing the processing power needed and reducing the load on other services.

At high level, the updater process is:

  1. get a stream of wikidata changes (either from Recent Changes API or by filtering Kafka events)
  2. deduplicate those events over a period of time
  3. enrich them with the actual data changed by querying Wikidata API
  4. batching the enriched changes to apply them to blazegraph

All this is a fairly standard event sourcing pattern.

The event stream is the same for all servers, so step 1), 2) and 3) could be shared, and they don't have any direct dependency on blazegraph. Step 4) needs to be done for each wdqs blazegraph instance.

Additional constraints:

  • we need to be able to replay events over some period of time (~2 weeks) during data load, data is loaded from a wikidata dump, and then updater process is used to catch up on event occurring after the dump
  • some level of ordering is required

It looks like k8s would be a reasonable place to run such a service. A single instance of the service would be needed as some shared state is required for deduplication and ordering. After step 3), events could be sent to another kafka topic. Step 4) would be a simplified updater, running on each wdqs node.

I'm probably missing a few things, feedback on the proposal is welcomed!

Event Timeline

Gehel created this task.Oct 24 2018, 9:41 AM
Gehel triaged this task as High priority.
Restricted Application added a project: Wikidata. · View Herald TranscriptOct 24 2018, 9:41 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev lowered the priority of this task from High to Normal.Oct 24 2018, 6:03 PM

Huh this is a big one. I've thought about it a bunch lately and here's roughly what I've got:

There are several way we can save some work on updating. I will list them here all though some of them more practical than others.

A. Blazegraph master-slave replication. Updater works on one node and updates are propagated on DB level to other nodes. This is possible in theory (Blazegraph has infrastructure to do that) but would require a lot of work as Blazegraph does not have protocols to do that.

B. Filter Kafka stream to exclude junk messages and provide "clean" update stream. This should not be very hard to do, but we should not put too many hopes into this one, as deduplication capabilities here are limited because we never know which timestamp which client starts with, so it's hard to make any serious deduplication in streaming mode. Basically, if events E1 and E2 happen for same ID within time T, deduplicating between them means holding issuing E1 for time T, which creates delay of T in the stream. Since we can not have long delay in the stream, T must be short. If we could when we've got E2 go back and revoke E1, then it could work better (though then the clients won't get E1's entity updated until E2 time but we can live with that since such client is behind anyway) but I don't think kafka allows to do such things.

Ideas are welcome here. If we saw whole stream at once we could probably save a bit of work, but that's not how updaters are working, and I am not sure how to create any real benefit from it. We could maybe have some cache of ID->last revision to quickly filter stale updates, but I am not sure we'd be saving a lot here, as we already have such filter against the database.
We could save downloading updates for other wikis that we don't care about, but my feeling is that does not change matters substantially.
This would also require running another service, with all dependencies and single point-of-failure scenarios that come from that.

C. We could cache data downloads from Wikidata more actively. Right now each poller basically fetches data un-cached. This is because we want to have the latest one. But if other host already fetched the latest one and it's in cache, we're not benefitting from it then. We could probably try to use some kind of "cache only if the content is not older than this timestamp" but I am not sure our Varnish knows how fresh Wikidata data is. We could probably try to create proper headers when sending RDF data and then try to use them when reading. This would require a lot of careful matching and will be a nightmare to debug if something goes wrong.

D. Fetching RDF data while pre-processing the Kafka stream is possible, but currently we check against the DB and not fetch the data for items that are already in the DB with latest version, and do not fetch twice for duplicates. However, since we can not substantially deduplicate in a generic stream (see above), this means a lot of data fetched and stored for active items. Again, if we had some kind of smart storage, we could probably make it so that the RDF data would be stored once-per-ID and updated on consequent fetches, but this already sounds as reinventing Varnish. Here some ideas may be welcome too.

As we can see, it's still kinda vaporous and vague and could use some work to define what can work and what can't.

Gehel added a comment.Oct 25 2018, 7:09 PM

There are 3 issues here, and maybe they should be addressed on different tickets:

  1. isolating updater from blazegraph: this is about reducing the interactions between the 2 components to what is essential, increasing robustness and simplifying investigation into any failure
  2. sharing update workload between different servers: at the moment, update workload is growing linearly with the number of blazegraph servers, both in term of internal workload and in term of induced workload on other components (kafka, wikidata, ...)
  3. optimizing the updater process itself

I think that 1) and 2) have a somewhat strong coupling and should be discussed together, 3) is a fairly different optimization which probably deserves its own ticket.

Addshore moved this task from incoming to monitoring on the Wikidata board.Oct 31 2018, 5:38 PM