Page MenuHomePhabricator

RESTBase sometimes not retaining stashed content?
Closed, ResolvedPublic

Description

There is something very weird going on. It seems that sometimes (read: under unknown circumstances) RESTBase does not store HTML and data-parsoid in the stash bucket when stash=true is passed to it. This has first been noticed in T233127 where VE receives a 404 from RESTBase when POSTing to transform/html/to/wikitext. This happens around 150 times per day.

We added logging in both VE and RB to clarify the situation, but this has only brought more question marks. After debugging this for a while, I'm now convinced it's a problem somewhere in RESTBase and/or Cassandra.

Here are some things that are confirmed not to cause this issue:

Here's an illustrative example for TID [96ca91a0-e9d3-11e9-b260-611ef1df68a5](https://logstash.wikimedia.org/goto/20c95dd0d3b062cdbb5fca06b2b33dfa):

  • VE requests Draft:Sandbox to which RB replies with a 200 with the appropriate headers
  • 8 seconds later VE POSTs to transform the HTML, but RB responds with a 404 because it can't find the stashed content

While I was trying to reproduce the error myself (without success), I noticed a very weird thing. I edited my user sandbox twice in a row with VE and successfully saved both revisions and then started a third edit. However, when I went to look into Cassandra, the first stash was not there any more:

cassandra@cqlsh:enwiki_T_parsoidd3o5Dn1wcj_Xve2tXe4_rtmeWSU> select key, tid, ttl(value) from data where "_domain" = 'en.wikipedia.org' and key = 'User:Mobrovac-WMF/sandbox:920061681:d43dc630-e904-11e9-8729-8b9cbd78a8be';

 key | tid | ttl(value)
-----+-----+------------

(0 rows)

cassandra@cqlsh:enwiki_T_parsoidd3o5Dn1wcj_Xve2tXe4_rtmeWSU> select key, tid, ttl(value) from data where "_domain" = 'en.wikipedia.org' and key = 'User:Mobrovac-WMF/sandbox:920237028:36a81d00-e9cf-11e9-ade6-3b8b9ff9d123';

 key                                                                      | tid                                  | ttl(value)
--------------------------------------------------------------------------+--------------------------------------+------------
 User:Mobrovac-WMF/sandbox:920237028:36a81d00-e9cf-11e9-ade6-3b8b9ff9d123 | 380eb1e0-e9cf-11e9-a24d-d79af64c7434 |      84190

(1 rows)

Event Timeline

mobrovac created this task.

@Pchelolo @Eevans could you take a look? Perhaps I've overlooked something, but to me this is starting to smell like Cassandra is losing data somehow.

@Pchelolo @Eevans could you take a look? Perhaps I've overlooked something, but to me this is starting to smell like Cassandra is losing data somehow.

What consistency level did you use in cqlsh (it defaults to ONE)?

What consistency level did you use in cqlsh (it defaults to ONE)?

Indeed I did not explicitly set the consistency level, but repeating the query with quorum consistency still gives an empty set.

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:20:41Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response - T170455 T234928

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:30:21Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@068d2ed]: Feed: Use Wikifeeds; Parsoid: Use the ETag revid for stashing and use the same ETag for stashing and response - T170455 T234928 (duration: 09m 40s)

Mentioned in SAL (#wikimedia-operations) [2019-10-09T12:59:24Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@aaadd73]: Parsoid: Retry fetching stashes with undefined as the revid - T234928

Mentioned in SAL (#wikimedia-operations) [2019-10-09T13:13:50Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@aaadd73]: Parsoid: Retry fetching stashes with undefined as the revid - T234928 (duration: 14m 26s)

Ha! I think I found the bug. The problem is that when a user first opens the page and starts editing with VE, then VE calls /page/html/{title}?stash=true, but RESTBase expects the revision ID to be present as well, which causes it to store the page under the key {title}:undefined:{tid} instead of {title}:{revid}:{tid} in the stash bucket. However, if the user tries to VE-edit a page that they edited before (i.e. it was already loaded, because they have already edited it either via VE or the wt editor), then VE calls /page/html/{title}/{revision}?stash=true.

I have deployed the fix as well as a fallback to look for {title}:undefined:{tid} in case the original stash could not be found, and so far so good - no transforms are failing! I will keep the task open and monitor for a little while longer to ensure this is truly the case (and that no other edge cases materialise).

Ha! I think I found the bug. The problem is that when a user first opens the page and starts editing with VE, then VE calls /page/html/{title}?stash=true, but RESTBase expects the revision ID to be present as well, which causes it to store the page under the key {title}:undefined:{tid} instead of {title}:{revid}:{tid} in the stash bucket. However, if the user tries to VE-edit a page that they edited before (i.e. it was already loaded, because they have already edited it either via VE or the wt editor), then VE calls /page/html/{title}/{revision}?stash=true.

I have deployed the fix as well as a fallback to look for {title}:undefined:{tid} in case the original stash could not be found, and so far so good - no transforms are failing! I will keep the task open and monitor for a little while longer to ensure this is truly the case (and that no other edge cases materialise).

Nice catch!

Change 542061 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/deployment-charts@master] RESTRouter: Use image v1.1.3

https://gerrit.wikimedia.org/r/542061

Change 542061 merged by jenkins-bot:
[operations/deployment-charts@master] RESTRouter: Use image v1.1.3

https://gerrit.wikimedia.org/r/542061

Mentioned in SAL (#wikimedia-operations) [2019-10-14T07:28:17Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@e0d071f]: Remove VE logging and stop using storage for /page/metadata - T234928 T235173

Change 542794 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Revert "[TEMP] Log all VE requests"

https://gerrit.wikimedia.org/r/542794

Change 542794 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Revert "[TEMP] Log all VE requests"

https://gerrit.wikimedia.org/r/542794

Mentioned in SAL (#wikimedia-operations) [2019-10-14T07:41:54Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@e0d071f]: Remove VE logging and stop using storage for /page/metadata - T234928 T235173 (duration: 13m 37s)

mobrovac removed a project: Patch-For-Review.

The bug has been found and fixed and temporary logging removed. We are done here.