Page MenuHomePhabricator

Infinite rerender loop in RESTBase
Closed, ResolvedPublic

Description

For some reason the rerender of HTML for the page https://commons.wikimedia.org/wiki/User_talk:BKMzastavka got caught in an infinite loop creating millions of copies of the same thing in the stashing table.

This resulted in 100 writes per second into the stashing table and a partition that grown to 150 gb.

It seems like the loop is originate within RESTBase itself. One weird thing about the page is than in the latest table there was a revision which according to MW doesn't even exist:

select * from "commons_T_parsoid__ng_html".data where "_domain" = 'commons.wikimedia.org' and key = 'User_talk:BKMzastavka';

User_talk:BKMzastavka | 269290610 | ed7e5e6e-d283-11e7-830e-948a9d85eed2
User_talk:BKMzastavka | 269070421 | b00719f8-d1cf-11e7-9ac5-c1fd1670d9dd
User_talk:BKMzastavka | 269070421 | e1aca272-d188-11e7-86f8-ead141b8ee5a

The revision number 269290610 doesn't exist. So, on an attempt to save a new render of 269070421, the latest bucket reported 412 (meaning that the content should be stored in the fallback bucket). The check for content equality probably also failed because of that 269290610 revision existence. The most pressing open question is why did it begin infinitely looping?

Event Timeline

Interesting, that revision 269290610 actually did exist for the page once, but it was somehow deleted, since that revision ID exists in MySQL archive table.

Wonder how is that possible, as it should be impossible to delete the latest revision of the page.

After breaking the loop it seems to have stopped. Digging through the code for a whole day was not fruitful and even an attempt to recreate similar situation by manually inserting some non-existent revision onto the storage didn't let me reproduce this.

I will move it to the services-later board and maybe reevaluate if a similar thing happens again.

This has been an elusive one, but I've solved the puzzle after this has happened again for 2 pages.

Here's what's happened:

  1. The page https://en.wikipedia.org/wiki/Hooghly_Collegiate_School whas vandalized after revision 840365221
  2. Several damaging revisions were created afterward, ending on revision XXX89 - by now restbase was storage being updated as normal, so revision XXX89 ended up being stored in the latest table.
  3. Several revisions were deleted on mediawiki side to revert vandalism, so revision 840365221 became the latest revision. However, RESTBase does not really DELETE the XXX89 from the latest table - we thought revisions can only be suppressed, not really deleted.
  4. So by this point, normal requests to RESTBase for this page is returning revision XXX89 - that's BAD since it's leaking vandalized data and the content has a really old parsoid version - 1.6.0.
  5. By request from Parsing we enable ensure_content_type filter to enforce minimal Parsoid version of 1.7.0 - this one reads data from storage and repeats the request with no-cache header (to make a fresh render) if the content version is too old.
  6. Something requests the page - the filter gets involved, it creates a new render, tries to store it to the latest bucket. But, the revision of the content already in the latest bucket is XXX89 while parsoid gives us back 840365221 since it's now the latest one according to MW. XXX89 was deleted - so the latest bucket returns 412 as we think that was a request for a specific revision (BUG HERE) so we store in the fallback bucket, send resource_change events.
  7. Change-Prop picks up the resource-change, rerenders MCS, it requests latest HTML with no no-cache header -> we loop back to item 6.

An infinite loop gets created which only multiplies over time.

As an ASAP fix I will disable the filter to prevent the loops from forming, as @arlora said it's ok on Parsoid side, but properly fixing it is a much more complex task.

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:30:19Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:34:16Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190 (duration: 03m 56s)

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:34:32Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 2

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:43:58Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 2 (duration: 09m 26s)

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:44:04Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 3

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:51:18Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 3 (duration: 07m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:51:28Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 4

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:56:15Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 4 (duration: 04m 47s)

Mentioned in SAL (#wikimedia-operations) [2018-09-06T20:58:14Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 5

Mentioned in SAL (#wikimedia-operations) [2018-09-06T21:02:15Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 5 (duration: 04m 01s)

Pchelolo edited projects, added Services (done); removed Services (doing).

By deploying T203823 the loop was actually broken, so now this shouldn't be happening anymore. The subtasks represent more long-term solutions, so they will stay open.

Pchelolo closed subtask Restricted Task as Declined.Oct 27 2020, 3:35 PM