Ensure consistency of secondary data after edit
Open, Needs TriagePublic


After a new revision is saved, secondary data, like link tables, is updated using ansynchronously via the DeferredUpdates. Deferred updates are executed out-of-band, after the transaction that updated the primary data (revision meta-data, etc) is complete. Depending on setup and situation, such updates may even be pushed to the job queue, and may not be processed for minutes.

External tools that keep track of edits on the wiki, by polling recentchanges or using some kind of life feed, can get stale data because of this. E.g. a tool that wants to replicated the category graph would query the categorylinks table (either directly or via the api) whenever a category page was edited. But the categorylinks table may not have been updated yet; the external graph cannot be kept up to date. The Wikidata Query Service is affected by this, regarding the page_links table, see T145712: Statement counts from pageprops do not match actual ones ( wikibase:statements and wikibase:sitelinks ).

There should be a mechanism for external tools to be notified when an edit has been fully processed.

Possible solution: for each edit, store the number of pending update jobs in a new field of the recentchanges table. When each update job completes, it decrements that counter. Entries in the recentchanegs table that have a non-zero count of pending updates can then be ignored when desired.

Alternatively, defer the entry to recentchanges until all other updates have completed. This would require a guaranteed order of execution for jobs, though.

daniel created this task.Oct 26 2016, 7:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2016, 7:54 PM
GWicke added a subscriber: Pchelolo.
GWicke added a subscriber: GWicke.EditedNov 1 2016, 5:21 PM

In my experience, most asynchronous updates only depend on a single, specific event. This makes it fairly straightforward to encode the dependency graph in ChangeProp rules or hooks, where the event(s) or hook calls emitted from one update triggers the next, dependent updates. I believe your category link example is one of those simpler cases.

Another, more complex source of timing uncertainty is MySQL slave lag. The two main mechanisms we use there are a) blocking the client until the slave has caught up to a defined offset (ChronologyProtector), or b) polling periodically until a unique bit of state (typically identified by an ID) has shown up.

The same polling techniques can also be used to dynamically update things like recentchanges client-side, for example to annotate the RC entry with an ORES score once it becomes available. If needed, long polling (possibly using the upcoming public EventStream service) can be used to lower latency and overheads.

Deferring everything until *all* updates have been processed suffers from many issues, and does not seem to be very realistic or desirable. For one, it would lose the obvious performance benefits of asynchronous updates by converting them into synchronous ones in the name of easier reasoning. Even establishing the full list of dependent updates will be next to impossible in any scalable event propagation system. There are a lot of issues around failure handling and scalability.

Waiting for multiple related events to have happened before proceeding will still be needed in some rare cases, but we need to be very careful in designing those properly to preserve the needed levels of robustness and performance.

Krinkle moved this task from Inbox to Backlog on the Architecture board.Mar 29 2017, 8:49 PM
Krinkle edited projects, added TechCom; removed Architecture.Thu, Jan 4, 12:33 AM