Page MenuHomePhabricator

Mismatched HTML + data-parsoid being processed by Parsoid on VE saves causing page corruptions
Closed, ResolvedPublic

Description

Reported on #wikimedia-tech:

https://en.wikipedia.org/w/index.php?title=Special:Search&search=insource%3A%2Fmw%5C%3APageProp%5C%2FCategory%2F&searchToken=q7qlwhu465dkkxm5lps3z47w has ~15 entries at the time of this bug report. About 10+ are from today.

They correspond to the deployment today where we updated Parsoid HTML version to 1.6.0 and RESTBase also switched storage to Cassandra 3.

Event Timeline

It seems that in some scenarios, RB is providing VE with HTML from the old storage (version 1.5.0 HTML), but on review changes / save, it is providing Parsoid with data-parsoid from new storage (version 1.6.0) which is the source of the corruption. But it is unclear why and under what circumstances this is happening. It is not common, but not an entirely transient issue since we see the occasional page on which this is happening.

From IRC:

<Pchelolo> we completely forgot one more agent in this whole picture - Varnish
<subbu> so, saves bypass varnish and hence get 1.6.0 data-parsoid.
<subbu> but loads can hit varnish and get 1.5.0 html
<Pchelolo> Actually I think that saves bypass varnish and get 1.6.0 original HTML but 1.5.0 data-parsoid because data-parsoid is always requested with a TID

Although the most recent corruptions are caused by the migration (we forgot about Varnish role in all of this) there exist some corruptions from before the migration that need to be investigated.

I added a few example page titles from the MCS logs to paste P6456. There are plenty more.

I'm doing research on this and I've chosen this particular corruption: https://en.wikipedia.org/w/index.php?title=Ozamiz&type=revision&diff=815149056&oldid=815011048

RESTBase new storage contains 2 renders of 815011048 and one render of 815149056 with matching TIDs and 1.6.0 profile.

I've also went to Varnish webrequest log and found the request that was made for HTML to make this particular change - and HTML was indeed served by Varnish.

I think that's enough evidence to support the theory that this was caused by Varnish-cached versions if HTML with 1.5.0 profile together with the bump in Parsoid version. I'm inclined to resolve the ticket now - I think mystery's solved here.

Thank you for the investigation @Pchelolo ! Let's not close the ticket just yet, though, as these might still appear until things do not fall out of cache. We should start the dump ASAP to mitigate this problem.

Pchelolo claimed this task.
Pchelolo edited projects, added Services (done); removed Services (next).

The things did fall out of cache by now and I've checked the insource search pages for major wikis and they don't have any new corruptions since yesterday. I'm resolving this ticket.