Page MenuHomePhabricator

Enable the query service updater to optionally use RecentChangesLinked for RC input
Open, NormalPublic

Description

Some people want to be able to run their own query service, linked to their own wikibase and updating from their own recent changes but also including a subset of wikidata items in their own query service and trying to keep this up to date with the items on Wikidata.org.

RecentChangesLinked could be a method to allow checking for subsets of recent changes, rather than changes to all items. Users that want to do this could create a page on wikidata containing links to the items they are interested in and have the updater hitting that.

As well as the change needed to the updater for a new target this would also mean there would be 2 updaters running, each with a different source. I imagine there would be conflicts with the dateUpdater timestamp stored in blazegraph, perhaps that would also need optional configuring or at least a bit of work to make a date map to a source?

Event Timeline

RazShuty created this task.Apr 24 2018, 8:22 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2018, 8:22 AM
Addshore renamed this task from Make recent changes linked optionally as an input for the QS updater to Enable the query service updater to optionally use RecentChangesLinked for RC input.Apr 24 2018, 8:38 AM
Addshore updated the task description. (Show Details)
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptApr 24 2018, 8:38 AM

It is not very hard to make Updater feed from another source, and it is no problem, in general, to have more than one Updater to run in parallel, especially if they modify different entries. However, having single timestamp to specify the position is indeed a problem for this scenario. It is possible to store another timestamp for different updater, I think, but we right now don't have good model for it. Maybe we need to look for solution that includes T192963: Store Kafka poller position data in the WDQS database and have some model that can store separate data sets for different poller models.

Maybe we need to look for solution that includes T192963: Store Kafka poller position data in the WDQS database and have some model that can store separate data sets for different poller models.

I was also looking at drafting a setup for the dockerized wikibase and query service stuff with kafka linking them.

Per T192963: Store Kafka poller position data in the WDQS database does this mean that right now kafka doesn't actually store the latest timestamp or any position in the query service? How does this affect the dateModified triple used for lag detection?

Gehel added a subscriber: Gehel.Apr 25 2018, 8:24 AM

Per T192963: Store Kafka poller position data in the WDQS database does this mean that right now kafka doesn't actually store the latest timestamp or any position in the query service? How does this affect the dateModified triple used for lag detection?

Kafka does store the offset for each partition, and we do use it during an updater run. We ignore it across updater restarts. This is done so that we can easily restart / replay updates from a time in the past, without messing with kafka (I'm not entirely convinced, but this does have some merit). So the offsets stored in the WDQS database are only used at the start of the updater.

The dateModified triple is a global check, so yes if we do have multiple updaters running in parallel, we might not catch the failure of only one of them with a single check. There are other things that could be checked in that case (like the batch progress), which can be published by updater.

The dateModified triple is a global check, so yes if we do have multiple updaters running in parallel, we might not catch the failure of only one of them with a single check.

There's a deeper question behind this - one that is also relevant for Kafka poller alone. If we have multiple streams of input (Kafka poller does) how do we define what is "current timestamp"? We have the following options:

  1. Minimum of the stream positions - the downside is that if one of the streams rarely has any events, say once per day, the timestamp is stuck behind
  2. Maximum of the stream positions - the danger is if one of the streams lag (and they are always lagging a bit behind one another, since the polling is not 100% parallel, but batched) then the messages in the delta might be lost
  3. Some other way? This is why I decided to persist Kafka offsets - this solves the problem of uneven timestamps in different streams.

One timestamp is OK when we're talking about one source (like dump), but for multiple sources we're likely to have to use multiple timestamps - or, in Kafka case, offsets.

Vvjjkkii renamed this task from Enable the query service updater to optionally use RecentChangesLinked for RC input to sdeaaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
AfroThundr3007730 renamed this task from sdeaaaaaaa to Enable the query service updater to optionally use RecentChangesLinked for RC input.Jul 1 2018, 6:27 AM
AfroThundr3007730 raised the priority of this task from High to Needs Triage.
AfroThundr3007730 updated the task description. (Show Details)
AfroThundr3007730 added a subscriber: Aklapper.
Smalyshev triaged this task as Normal priority.Feb 27 2019, 6:39 AM
Smalyshev removed a project: User-Smalyshev.