Page MenuHomePhabricator

Ensure consistency of secondary data for external consumers
Closed, DeclinedPublic



When a revision is created, there are secondary data updates we perform (such as link tables, categories etc.). These updates can be deferred via DeferredUpdates or JobQueue.

The event for such revision is published to external consumers (e.g. via RCFeed, EventStreams, API:RecentChanges polling etc.) before all secondary data updates are applied.

The problem is that this creates problems when consumers need to react to an event. As they have no means to know when to start reacting to it and and when the related changes to the database are complete from the API perspective.

Original description

After a new revision is saved, secondary data, like link tables, is updated asynchronously via DeferredUpdates. Deferred updates are executed out-of-band, after the transaction that updated the primary data (revision meta-data, etc) is complete. Depending on setup and situation, such updates may even be pushed to the job queue, and may not be processed until several minutes later.

External tools that keep track of edits on the wiki, by polling RecentChanges or using some kind of life feed, can get stale data because of this. E.g. a tool that wants to replicate the category graph would query the categorylinks table (either directly or via the API) whenever a category page was edited. But the categorylinks table may not have been updated yet; the external graph cannot be kept up to date. The Wikidata Query Service is affected by this, regarding the page_links table, see T145712: Use RDF statement counts from entity data, not page props ( wikibase:identifiers, wikibase:statements and wikibase:sitelinks ).

There should be a mechanism for external tools to be notified when an edit has been fully processed.

Solution 1

For each edit, store the number of pending update jobs in a new field of the recentchanges table. When an associated job completes, it decrements that counter. Entries in the recentchanegs table that have a non-zero count of pending updates can then be ignored when desired.

Solution 2

Defer the entry to recentchanges until all other updates have completed. This would require a guaranteed order of execution for jobs, though.

Event Timeline

In my experience, most asynchronous updates only depend on a single, specific event. This makes it fairly straightforward to encode the dependency graph in ChangeProp rules or hooks, where the event(s) or hook calls emitted from one update triggers the next, dependent updates. I believe your category link example is one of those simpler cases.

Another, more complex source of timing uncertainty is MySQL slave lag. The two main mechanisms we use there are a) blocking the client until the slave has caught up to a defined offset (ChronologyProtector), or b) polling periodically until a unique bit of state (typically identified by an ID) has shown up.

The same polling techniques can also be used to dynamically update things like recentchanges client-side, for example to annotate the RC entry with an ORES score once it becomes available. If needed, long polling (possibly using the upcoming public EventStream service) can be used to lower latency and overheads.

Deferring everything until *all* updates have been processed suffers from many issues, and does not seem to be very realistic or desirable. For one, it would lose the obvious performance benefits of asynchronous updates by converting them into synchronous ones in the name of easier reasoning. Even establishing the full list of dependent updates will be next to impossible in any scalable event propagation system. There are a lot of issues around failure handling and scalability.

Waiting for multiple related events to have happened before proceeding will still be needed in some rare cases, but we need to be very careful in designing those properly to preserve the needed levels of robustness and performance.

I think we have bigger problem than described here:

External tools that keep track of edits on the wiki, by polling RecentChanges or using some kind of life feed, can get stale data because of this

It's one thing that we do not have in-time updates. But the situation right now seems to be that some items seem to be never updated at all. I.e. if I look at some items now, several months after the data is updated, the page props data in the table is still wrong. Thus this:

There should be a mechanism for external tools to be notified when an edit has been fully processed.

Would be of little help since it looks like this item has never been fully processed.

@Smalyshev secondary data not getting updated at all is a separate problem in my mind. Though having a solution that allows us to know whether an edit has been completely processed would probably also allow us to detect situations where the processing never happens.

Krinkle renamed this task from Ensure consistency of secondary data after edit to Ensure consistency of secondary data for external consumers.Mar 21 2018, 8:52 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)

There should be a mechanism for external tools to be notified when an edit has been fully processed.

I have very little context here, but knowing when and edit is fully processed sounds be pretty hard. For your examples though, perhaps each dependent change could emit its own event? Services that need to know when e.g. a categorylinks update has happened could subscribe to a categorylinks-change stream. Then you don't need to keep track of when all edit dependents have updated. Instead your service can subscribe to what it cares about and react to it.

Just a thought! :)

@Ottomata this is definitely possible, but the problem is we have a bunch of places that does secondary data by now - links, Wikibase quality, page props, etc. These updates can happen in any order, so if we have a tool that has to listen for all of them, it can generate a lot of updates for the same page, which can be very inefficient, especially given that not always it's possible to know which exactly data has been changed. So more efficient solution would be if we knew there's a moment where no more changes for this edit is coming - though I am not 100% sure whether it's possible.

Rereading this task reminds me of Flink's EventTime processing with watermarks and allowed lateness.

Also, this talk is long, but pretty amazing:

If we were using something like Flink to do these updates, we'd have a built in mechanism for dealing with this problem. Not that it would _solve_ the problem 100%, but I think it would at least give us a generic way of dealing with it. :)

The problem I guess is that we have a lot of replicas, so I am not sure there's a mechanism to know when all replicas have finished catching up. I guess if we had some stream that would produce min(revision) over all replicas, we could use it as some kind of watermark generator, but I don't think we have that now. In general, I think such mechanism would solve the immediate problem, though we have also the secondary problem with jobs that perform secondary updates influencing the data (such as page props). I guess holding update for a minute or two to wait for replicas to catch up and jobs to finish would be ok, so if we had such system working it would make the problem easier. For now, I've made a workaround for it in places where I can detect it, but having more comprehensive solution would be nice.

Closing as declined. There doesn't seem to be sufficient need from the product side to attempt a solution, given that such a solution would probably be hard to implement. We may want to use at fixes for more narrow use cases.