Ensure consistency of secondary data for external consumers
Open, Needs TriagePublic

Description

Problem

When a revision is created, there are secondary data updates we perform (such as link tables, categories etc.). These updates can be deferred via DeferredUpdates or JobQueue.

The event for such revision is published to external consumers (e.g. via RCFeed, EventStreams, API:RecentChanges polling etc.) before all secondary data updates are applied.

The problem is that this creates problems when consumers need to react to an event. As they have no means to know when to start reacting to it and and when the related changes to the database are complete from the API perspective.

Original description

After a new revision is saved, secondary data, like link tables, is updated asynchronously via DeferredUpdates. Deferred updates are executed out-of-band, after the transaction that updated the primary data (revision meta-data, etc) is complete. Depending on setup and situation, such updates may even be pushed to the job queue, and may not be processed until several minutes later.

External tools that keep track of edits on the wiki, by polling RecentChanges or using some kind of life feed, can get stale data because of this. E.g. a tool that wants to replicate the category graph would query the categorylinks table (either directly or via the API) whenever a category page was edited. But the categorylinks table may not have been updated yet; the external graph cannot be kept up to date. The Wikidata Query Service is affected by this, regarding the page_links table, see T145712: Statement counts from pageprops do not match actual ones ( wikibase:statements and wikibase:sitelinks ).

There should be a mechanism for external tools to be notified when an edit has been fully processed.

Solution 1

For each edit, store the number of pending update jobs in a new field of the recentchanges table. When an associated job completes, it decrements that counter. Entries in the recentchanegs table that have a non-zero count of pending updates can then be ignored when desired.

Solution 2

Defer the entry to recentchanges until all other updates have completed. This would require a guaranteed order of execution for jobs, though.

daniel created this task.Oct 26 2016, 7:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2016, 7:54 PM
GWicke added a subscriber: GWicke.EditedNov 1 2016, 5:21 PM

In my experience, most asynchronous updates only depend on a single, specific event. This makes it fairly straightforward to encode the dependency graph in ChangeProp rules or hooks, where the event(s) or hook calls emitted from one update triggers the next, dependent updates. I believe your category link example is one of those simpler cases.

Another, more complex source of timing uncertainty is MySQL slave lag. The two main mechanisms we use there are a) blocking the client until the slave has caught up to a defined offset (ChronologyProtector), or b) polling periodically until a unique bit of state (typically identified by an ID) has shown up.

The same polling techniques can also be used to dynamically update things like recentchanges client-side, for example to annotate the RC entry with an ORES score once it becomes available. If needed, long polling (possibly using the upcoming public EventStream service) can be used to lower latency and overheads.

Deferring everything until *all* updates have been processed suffers from many issues, and does not seem to be very realistic or desirable. For one, it would lose the obvious performance benefits of asynchronous updates by converting them into synchronous ones in the name of easier reasoning. Even establishing the full list of dependent updates will be next to impossible in any scalable event propagation system. There are a lot of issues around failure handling and scalability.

Waiting for multiple related events to have happened before proceeding will still be needed in some rare cases, but we need to be very careful in designing those properly to preserve the needed levels of robustness and performance.

Krinkle moved this task from Inbox to Backlog on the Architecture board.Mar 29 2017, 8:49 PM
Krinkle edited projects, added TechCom; removed Architecture.Jan 4 2018, 12:33 AM
Krinkle updated the task description. (Show Details)Mar 14 2018, 8:38 PM

I think we have bigger problem than described here:

External tools that keep track of edits on the wiki, by polling RecentChanges or using some kind of life feed, can get stale data because of this

It's one thing that we do not have in-time updates. But the situation right now seems to be that some items seem to be never updated at all. I.e. if I look at some items now, several months after the data is updated, the page props data in the table is still wrong. Thus this:

There should be a mechanism for external tools to be notified when an edit has been fully processed.

Would be of little help since it looks like this item has never been fully processed.

@Smalyshev secondary data not getting updated at all is a separate problem in my mind. Though having a solution that allows us to know whether an edit has been completely processed would probably also allow us to detect situations where the processing never happens.

kchapman edited projects, added TechCom-RFC; removed TechCom.Mar 16 2018, 2:18 AM
kchapman moved this task from Inbox to Under discussion on the TechCom-RFC board.
Krinkle renamed this task from Ensure consistency of secondary data after edit to Ensure consistency of secondary data for external consumers.Mar 21 2018, 8:52 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)

There should be a mechanism for external tools to be notified when an edit has been fully processed.

I have very little context here, but knowing when and edit is fully processed sounds be pretty hard. For your examples though, perhaps each dependent change could emit its own event? Services that need to know when e.g. a categorylinks update has happened could subscribe to a categorylinks-change stream. Then you don't need to keep track of when all edit dependents have updated. Instead your service can subscribe to what it cares about and react to it.

Just a thought! :)

@Ottomata this is definitely possible, but the problem is we have a bunch of places that does secondary data by now - links, Wikibase quality, page props, etc. These updates can happen in any order, so if we have a tool that has to listen for all of them, it can generate a lot of updates for the same page, which can be very inefficient, especially given that not always it's possible to know which exactly data has been changed. So more efficient solution would be if we knew there's a moment where no more changes for this edit is coming - though I am not 100% sure whether it's possible.