Page MenuHomePhabricator

ChangeProp may have race conditions
Open, Needs TriagePublic

Description

Follow on from T328880: Clear abusive image from PagePreviews/Images on eswiki and {T271184#8585977}, it seems there's a race condition with ChangeProp and what I'll call "derived content endpoints". This is going to end up solved for the page summary endpoint by basically not caching data persistently, but the fundamental issue is that endpoints which use a concept like "latest version" (aka get page summary for the latest version of the page) can have version drift if "latest version" is propagated to the dependency instead. In the bug in question, page summary was asked for the summary for the "latest version" so it asked restbase for "the latest version" of the HTML, but in the event there was a race condition because Parsoid was busy parsing a /new/ "latest version" at the time. So RESTbase returned the "old latest version" and the summary was computed on that and stored persistently as the summary for "the latest version" even after the Parsoid parse had completed and updated the "latest version" in RESTBase. In the event the version that got summarized had been vandalized, and the /revert/ of the vandalism was the "latest version" which got missed, and the vandalism was on a template so this race condition happened for a large number of pages, affecting about half of them.

There seems to be a fundamental issue here with changeprop, where the summary didn't get invalidated here. Possibly we fired off the page summary invalidation too soon (?) aka when the restbase content was invalidated but *before* the new restbase content had been computed? This seems somewhat fundamental to the design of a separate "change prop" service, which tries to do the dependency propagation independent of the services which are computing the "new" values.

In order to handle these sorts of races, it seems like either requests during dependency propagation need to have a "don't return the result until you actually have the /latest/" flag (aka return a HTTP error which means "wait a while and try again later"), *or* cache invalidation needs to have tombstones registered so that (eg) when page summary gets an invalidation for a revision it doesn't have yet it "remembers" that invalidation and applies it when it later gets the parse corresponding to that invalidation.