Page MenuHomePhabricator

DiscussionTools touched unrelated parts of the page
Closed, ResolvedPublic

Description

This task is for two separate wikitext corruption issues that occurred when adding comments to pages using DiscussionTools. They were both triggered by the deployment of the replying API (T252558), but had different causes and solutions. We should have made two tasks, but it wasn't initially clear that they are unrelated, and now all of the comments are on this task, so let's keep it as is.

Issue 1

(occurred on: 6-7 August; affected edits: ~50-100, see T260393#6384098)

When the replying API was serializing the modified document to HTML, it encoded various characters as HTML entities rather than plain text, resulting in Parsoid's selser not recognizing unmodified parts of the page, causing dirty diffs.

Most of the changes did not damage the page, and only caused distracting diffs, e.g. namespaces in internal links being changed to the canonical ones, external links and internal link anchors being percent-encoded, spaces at the ends of lines being removed.

However, some of them exposed bugs in Parsoid that generate incorrect wikitext, in particular unnecessarily generating a |link= parameter (or a localised version) for images (T108504). Also, percent-encoded links are quite annoying in non-Latin-alphabet languages.

Fixed in: https://gerrit.wikimedia.org/r/619018

Issue 2

(occurred on: 12-13 August; affected edits: 2)

When Parsoid was transforming a HTML document to wikitext, if there was very high replication lag ("replag"), and if a specific page revision was requested rather than "latest", Parsoid's selser would fetch the wikitext for unmodified parts of the page from the wrong revision, causing page corruption.

At the time some database replica servers were delayed by several hours due to maintenance.

DiscussionTools replying API loads the latest revision of the page using a database query, then asks Parsoid for that revision. In contrast, VisualEditor asks Parsoid for "latest" revision and then checks what revision it received, so it was never affected by this issue. (It's like this because the former was simpler to implement in server-side code and the latter was simpler in client-side code.)

Only two edits were affected:

Fixed in: https://gerrit.wikimedia.org/r/621621

Testing instructions

At the Arabic, Catalan, Chinese, Czech, Dutch, Korean, Serbian and Swedish beta clusters [i], do the following:

  1. Write and publish a comment in said wiki's native language using the Reply Tool's visual mode
  2. Write and publish a comment in said wiki's native language using the Reply Tool's source mode
  3. Ensure that "Step 1" and "Step 2" above did not cause dirty diffs (read: no additional changes are made to the page beyond the text that was written with the Reply Tool being added)
    • ⚠️Please pay special attention to the issues described in the "Observed cases of disruption" section above.

i. https://meta.wikimedia.beta.wmflabs.org/wiki/Special:SiteMatrix

Related Objects

Mentioned In
T261715: MediaWiki doesn't detect edit conflicts when replica databases don't have the latest edit
T261321: Inspect Reply Tool diffs for instances of corruption
T259022: Revision 105215021 belongs to page ID 2442048, the provided Title object belongs to page ID 8437055
T108504: Parsoid/VE adds images with pointless link parameter
T257581: Visual Editor adds link= parameter to image when only caption is edited
T234549: "Properly" address missing srcText issues in PageConfigFrame / missing revID in REST API
T182740: Detect when a page save is likely to be a corruption
T260393: Review page corruption (dirty diffs) caused by DiscussionTools
T260294: "TypeError: Cannot read property 'linterrors' of undefined" when replying to a comment
T252558: Create a low bandwidth reply API using parser.php/modifier.php
Mentioned Here
T261321: Inspect Reply Tool diffs for instances of corruption
T108504: Parsoid/VE adds images with pointless link parameter
T260393: Review page corruption (dirty diffs) caused by DiscussionTools
T234549: "Properly" address missing srcText issues in PageConfigFrame / missing revID in REST API
T182740: Detect when a page save is likely to be a corruption
T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags
T260067: (HTTP) 504 Gateway Timeout error when trying to save
T260294: "TypeError: Cannot read property 'linterrors' of undefined" when replying to a comment
T252558: Create a low bandwidth reply API using parser.php/modifier.php

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks for fixing this issue.

It's done with the patch https://gerrit.wikimedia.org/r/619018, so let's keep using this ticket.

Noted. Thank you, Bartosz.

The root cause was fixed in 619018.

Here is a recreation of one of the failing cases on sv.wiki working as expected on sv.beta.wiki:

https://sv.wikipedia.beta.wmflabs.org/w/index.php?title=Diskussion%3AHuvudsida&type=revision&diff=23&oldid=22

Excellent. Thank you, Ed. A question for you:

  • Does the task description's "Testing instructions" section accurately and exhaustively describe the testing that should be done on this patch?
    • If it does NOT, can you please edit to make it so and then move this ticket to the Editing QA's "High priority" column?
    • If it does, can you please move this ticket to the Editing QA's "High priority" column?

Didn't notice any dirty diffs on these pages during my test, but please double check.

Excellent – thank you, Rummana. I'm going to review "live" diffs once the fix reaches all Wikipedias on Thursday.

When testing the revert in production, I got the error message "TypeError: Cannot read property 'linterrors' of undefined" several times, but it stopped happening after refreshing the page a few times, before I got the chance to find out what was causing it. I suspect it was caused by some temporary issue with the deployment, like the wrong version of some JS files being cached/served. Just documenting this here in case someone else sees that error.

Okay, I saw this error again myself. I filed T260294 about it now. I don't think it's related to the main problem here.

I assume the Thursday fix train didn't arrive this morning yet, but also assume more examples of disruptive reply tool edits are still welcome. (Or are these from another category, especially the first mentioned one?)

Tacsipacsi raised the priority of this task from High to Unbreak Now!.Aug 13 2020, 5:48 PM

I assume the Thursday fix train didn't arrive this morning yet

Actually it arrived yesterday in the evening: https://sal.toolforge.org/log/c78a5HMBj_Bg1xd3qfwz (tomorrow is holiday).

Similar breakage also happened on huwiki. Raising to UBN, probably it should be rolled back to the previous version once again.

This seems to be a lot less common than the previous problem, but we should probably still revert…

Change 620095 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@master] Revert new reply API (again)

https://gerrit.wikimedia.org/r/620095

Change 620029 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@wmf/1.35.0-wmf.4] Revert new reply API (again)

https://gerrit.wikimedia.org/r/620029

Change 620029 abandoned by Bartosz Dziewoński:
[mediawiki/extensions/DiscussionTools@wmf/1.35.0-wmf.4] Revert new reply API (again)

Reason:
Wrong branch

https://gerrit.wikimedia.org/r/620029

Change 620030 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@wmf/1.36.0-wmf.4] Revert new reply API (again)

https://gerrit.wikimedia.org/r/620030

Change 620030 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@wmf/1.36.0-wmf.4] Revert new reply API (again)

https://gerrit.wikimedia.org/r/620030

Mentioned in SAL (#wikimedia-operations) [2020-08-13T19:35:06Z] <thcipriani@deploy1001> Synchronized php-1.36.0-wmf.4/extensions/DiscussionTools: [[gerrit:620030|Revert new reply API (again)]] T259855 (duration: 00m 57s)

Urbanecm lowered the priority of this task from Unbreak Now! to High.Aug 13 2020, 8:11 PM
Urbanecm subscribed.

No longer anUBN

I wrote a small script to review all edits made with DiscussionTools since the wmf.4 deployment on the wikis where it's enabled as a beta feature, and I haven't found any other edits that have this kind of corruption – only the two that were already reported here:

CC @cscott the above 2 diffs are very strange ^^^

I can't reproduce any corruption with the same content locally. It doesn't appear to be a browser issue as the same users managed to successfully reply elsewhere on the page.

This leads me to think it was some sort of contemporaneous issue like an edit conflict or cache corruption.

Meta
I'm moving this to "Doing" considering the underlying issue(s) are still being researched.

  • @Esanders / @matmarex: considering the corruption Marc and Tacsipacsi reported [i][ii] seems to have been caused by something different from what prompted this original ticket [iii], should this ticket – T259855 – be resolved and a new ticket created for the most recent cases?

i. T259855#6383590
ii. T259855#6383650
iii. As Bartosz alluded to in T259855#6383706

Our initial diagnosis is that this is DSR corruption, aka data-parsoid corruption. This is the only mechanism that causes this sort of symptoms.

There was an old Parsoid/stashing/frame source text bug with similar symptoms: T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags.

There are a couple possible root causes for the DSR corruption:

  1. Something having to do with the edit stash mechanism (is there a mode to switch between wikitext and VE editing in this deployed version of reply tools?)
  2. Corrupted IDs during the edit, causing incorrect association of data-parsoid etc
  3. Corrupted/missing etag or revision id (note that the etag is more precise: you can parse a given revision id in multiple ways, it's possible you're using the revision id but with the wrong etag/timestamp)
  4. Possibly an edit conflict sort of situation, but probably this would be just a symptom of incorrect etag
  5. Blank etags used by VE in some situations? T260067#6374117

Relatedly, we are going to revive T182740: Detect when a page save is likely to be a corruption and try to roll it out sooner than later so we prevent such saves from going through and return an error with sufficient log info to detect root causes instead of trying to speculate after the fact like we are doing now.

So, I stared at https://nl.wikipedia.org/w/index.php?title=Wikipedia%3AVerzoekpagina_voor_moderatoren%2FRegBlok&type=revision&diff=56909511&oldid=56909434 for a bit and here is some info that might be useful. I am looking at pretty much the end of the diff there and I am going ot paste some snippets here:

If you notice the second column of the diff (the after part), you will notice some interesting pieces to the corruption (which I noticed by how some pieces of corrupted text were lining up in the diff with headings on that same side)

  1. n PA. Deze di on the right is the same length as ===DPTC.nl=== as the left.
  2. *{{d}} - ongewe on the right is the same length as === Marenzs ===` as the left.
  3. So, I thought huh .. so, maybe the section between those 2 headings on the left has the same length as that on the left. And, it is indeed.
cussie hoort verder niet op deze plaats thuis, maar op overlegpagina's. [[Gebruiker:Joris|Joris]] ([[Overleg gebruiker:Joris|overleg]]) 5 aug 2020 15:43 (CEST)
===DPTC.nl===
*{{lg|DPTC.nl}} - Domeinnaam, geen gebruiker dus. [[Gebruiker:Japiot|Japiot]] ([[Overleg gebruiker:Japiot|overleg]]) 7 aug 2020 14:06 (CEST)

on the right is the exact same length as this on the left:

*{{lg|DPTC.nl}} - Domeinnaam, geen gebruiker dus. [[Gebruiker:Japiot|Japiot]] ([[Overleg gebruiker:Japiot|overleg]]) 7 aug 2020 14:06 (CEST)
**{{d}} - ongewenste gebruikersnaam - buiten gebruik genomen; Bedankt voor het melden. [[Gebruiker:Tulp8|Tulp8]] ([[Overleg gebruiker:Tulp8|overleg]]) 7 aug 2020 14:11 (CEST)

The offset shift is exactly 175 characters. And guess what. The revision before this added 175 characters to the page.

So, here is what might have happened.

Both editors opened the same version of the page, and so have the same DSR offsets for the HTML. But, in between, editor Edoderoo saved their version which added 175 characters to the page. And, when editor Ymnes saved their edit, the offsets refer to the version of the page before Edoderoo's, but Parsoid is probably using wikitext for the page that comes from Edoderoo's save => everything is shifted by -175 characters.

I don't know where the root cause of the bug lies but, something in the "edit conflict" handling is broken.

There is a remarkable resonance to T227216 here in that Parsoid is using wikitext source from the wrong revision because Parsoid is unaware of the intermediate save. If page src is posted to Parsoid via the HTTP save request, then clearly RESTBase/VE is messing up. But, if Parsoid is fetching page source (as in T227216) for the revision, then, somehow Parsoid has the wrong revid or is ignoring the revid.

Change 620095 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] Revert new reply API (again)

https://gerrit.wikimedia.org/r/620095

....
So, here is what might have happened.

Both editors opened the same version of the page, and so have the same DSR offsets for the HTML. But, in between, editor Edoderoo saved their version which added 175 characters to the page. And, when editor Ymnes saved their edit, the offsets refer to the version of the page before Edoderoo's, but Parsoid is probably using wikitext for the page that comes from Edoderoo's save => everything is shifted by -175 characters.

I don't know where the root cause of the bug lies but, something in the "edit conflict" handling is broken.

So, I looked at https://hu.wikipedia.org/w/index.php?title=Wikip%C3%A9dia%3AAdminisztr%C3%A1torok_%C3%BCzen%C5%91fala&type=revision&diff=22943756&oldid=22943733 as well now.
It has an intermediate edit that added 2332 chars which is actually that entire section in the left hand column of the diff starting with:

:Habár retorikailag kifizetődőnek tűnik a szavakon lovagolni (mintha kőbe lennének itt vésve az egyes használt kifejezések és nem az lenne a lényeg, ..

And, you will see that when Parsoid serialized the HTML from this edit, it picked the substring :Habár retorikailag kifizetődő in place of the header === A Lecsó cikk lapvédelme === and so on. The mismatch is exactly equal to the length of that intermediate comment. So, looks like a case of wikitext source coming from latest revision instead of from the original revid.

So, both the huwiki and nlwiki scenarios are clear "edit conflict" cases. The unresolved question is why Parsoid got the wrong wikitext source. @Pchelolo clarified on IRC that RESTBase doesn't post wikitext to Parsoid. So, to narrow it down further: (a) either RESTBase did not post the revid with its HTML -> wikitext request => Parsoid will fetch latest wikitext source OR (b) Parsoid misfetched the original wikitext for the page despite having the right revid.

An unrelated issue is why DiscussionTools / VE missed flagging these as edit conflicts. That should be separately investigated by the Editing-team but on the Parsoid end, we are narrowly interested in why Parsoid got the wrong wikitext source in these edit conflicts that slipped through to Parsoid.

An unrelated issue is why DiscussionTools / VE missed flagging these as edit conflicts. That should be separately investigated by the Editing-team but on the Parsoid end, we are narrowly interested in why Parsoid got the wrong wikitext source in these edit conflicts that slipped through to Parsoid.

I don't think edit conflict handling ever came into play here. Edit conflicts can only happen when an edit attempt uses a revision different than the latest one as the base revision (as indicated by basetimestamp or baserevid). But as far as I can tell, DiscussionTools knew which revision was latest and used that one, so no edit conflict could happen.

The issue with RESTBase or Parsoid using a different revision for transforming HTML to wikitext is entirely separate. Edit conflicts are only a thing in the wikitext land, after we have already transformed our HTML to wikitext, and when we are trying to save a page with that wikitext.

An unrelated issue is why DiscussionTools / VE missed flagging these as edit conflicts. That should be separately investigated by the Editing-team but on the Parsoid end, we are narrowly interested in why Parsoid got the wrong wikitext source in these edit conflicts that slipped through to Parsoid.

I don't think edit conflict handling ever came into play here. Edit conflicts can only happen when an edit attempt uses a revision different than the latest one as the base revision (as indicated by basetimestamp or baserevid). But as far as I can tell, DiscussionTools knew which revision was latest and used that one, so no edit conflict could happen.

The issue with RESTBase or Parsoid using a different revision for transforming HTML to wikitext is entirely separate. Edit conflicts are only a thing in the wikitext land, after we have already transformed our HTML to wikitext, and when we are trying to save a page with that wikitext.

I probably should used a different term. What I meant was: from the corruption POV, both edits seem to be using the same base wikitext revision and any attempt to save both would inevitably conflict and hence didn't get detected.
But, you are saying because of some bug in the communication pipeline between VE -> RB -> Parsoid, the corruption might look like an edit conflict, but on the UI end, it never was an edit conflict in any sense. That could be the case!

Does anyone know what https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-mediawiki-2020.08.13/mediawiki?id=AXPmAWqf3_NNwgAU8tCg&_g=h@a77a039 and https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-mediawiki-2020.08.12/mediawiki?id=AXPkRRWK3_NNwgAUHD3I&_g=h@58e4df1 mean?

Those are from logstash for the two edits that preceded the corrupt edits on nwiki & huwiki. There are many other such stashedit errors in logstash too but not sure what 'StashEdit' is but pasting here in case that is related.

Looks like those come from PageEditStash.php and may not have a bearing on the issue, or at least I cannot see how.

Here's the theory, sanity-checked as plausible by @tstarling. Revision IDs are given for the corrupted edit at https://nl.wikipedia.org/w/index.php?title=Wikipedia:Verzoekpagina_voor_moderatoren/RegBlok&diff=next&oldid=56909434

  • In ApiDiscussionToolsEdit we get not-the-latest-revision, in this case 56909335, due to DB replica lag
    • ApiParsoidTrait::getLatestRevision has the actual DB call, to RevisionLookup::getRevisionByTitle()
    • Note that VE doesn't have this code path; it does a client-side request to RESTBase w/o revid, and uses the revid from the response. This could have different behavior under replag
  • The 56909335 revid is held by DiscussionTools and passed to RESTBase who turns around and passes it to Parsoid
  • It's important that Parsoid *thinks* it has a non-zero non-null oldid, or else it will not use selser
  • However, in PageConfigFactory::create Parsoid calls RevisionStore::getRevisionById( 56909335 ) and gets null (!)
    • This gets a replica DB connection and *should* attempt to load the revision from master instead if it fails, but:
    • The condition includes $lb->hasOrMadeRecentMasterChanges(), which was added in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/200810; that must be false here. (We know $flags is 0 and ::hasStreamingReplicaServers() is true)
  • That causes Parsoid later on to fetch wikitext corresponding to the *latest* revision for the title, which is 56909434, and that succeeds
  • Parsoid then tries to do selective serialization using the HTML from 56909335 but the wikitext from 56909434, which results in corruption
  • The earlier edit removed content from line 88, which *lower* in the document than the new reply being made on line 76. So the selser corruption only occurs starting at line 88. When EditPage detects the edit conflict (because DiscussionTools is using revid 56909335, not the latest) it tries to do a 3-way merge to resolve it. Because area around the new reply is uncorrupted (the corruption starts below), the 3-way merge succeeds and the merged version is saved. (Otherwise the user would get a warning about an edit conflict and the corruption wouldn't be saved.)
    • The edit conflict handing in VE is likely different here as well. There may be other codepaths to catch an edit conflict, and if it is submitted to EditPage VE edits are probably much more likely to fail the attempted 3-way merge and thus not result in saved corruption.

I tested this by manually hacking ApiParsoidTrait::getLatestRevision() to return an old revision, and manually resetting the $rev to null after Parsoid attempts the getRevisionById() call. With these hacks I was able to reproduce corruption on my local wiki with the same appearance as this bug. As long as the corruption was below the new reply being added, I could avoid the edit conflict and the corruption would be saved.

Two follow ups:

  1. Restructure the Parsoid page config code so that if getRevisionById() returns null, we throw an error or crash, instead of silently proceeding to use the latest revid instead.
    • still need to figure out what http error code to return if we can't get a revision -- 404? 500?
    • we could probably add more sanity-checking: that the version in the HTML <head> matches the revid we expect; that PageConfig::getRevision()->getId() matches the ID we expect; that selser in certain easy cases lines up properly (T182740)
  2. Consult with @aaron about whether/how to retry getRevisionById()
    • @tstarling says, ask whether he "thinks it's better to throw an exception on replag or to hit the master or if he has some better idea"
    • "if the answer is to hit the master then that should be implemented as a new flag to getRevisionById(), otherwise it may hit the master twice if hasOrMadeRecentMasterChanges() is true"
    • "you can read Aaron's considerations on that change I just linked to -- it wasn't just replag causing master traffic, it was every missing revision, every bad URL"

Two lesser-priority follow-ups:

  • DiscussionTools could probably refactor to use the same codepath VE does, that is, make the first restbase request w/o revid, and parse the revid from the response, rather than have the extra DB query to get the latest revision first. Might not solve this bug, but it would increase commonality of codepaths and get a slight bit of extra performance.
  • We should really revisit the REST API w/ an eye to explicitly differentiating the "no revid" case (when you are creating a new page) from the "revid expected" case (every other edit). There are "missing source" bugs possible in many places in our codebase which are masked by the fact that sometimes source is expected and sometimes not (T234549: "Properly" address missing srcText issues in PageConfigFrame / missing revID in REST API).

I have some questions to help me understand:

Revision 56909335 was saved at 00:39, the next one at 04:08 and the buggy one at 05:55. Does this mean there was several hours of lag between master and replicas? (I didn't think that was realistic, so I ignored this possibility earlier when thinking about edit conflicts.)

How is this possible? Was it querying from a different replica sever even more lagged than the one DiscussionTools got that revid from, so that revision really didn't exist there?

I tested this by manually hacking ApiParsoidTrait::getLatestRevision() to return an old revision, and manually resetting the $rev to null after Parsoid attempts the getRevisionById() call. With these hacks I was able to reproduce corruption on my local wiki with the same appearance as this bug. As long as the corruption was below the new reply being added, I could avoid the edit conflict and the corruption would be saved.

So this scenario can also end with an edit conflict rather than successful merge? If so, we could probably look at error logging data in DiscussionTools to see if we had an increased number of them logged around the time of the corrupted edits, to confirm this.

Two lesser-priority follow-ups:

  • DiscussionTools could probably refactor to use the same codepath VE does, that is, make the first restbase request w/o revid, and parse the revid from the response, rather than have the extra DB query to get the latest revision first. Might not solve this bug, but it would increase commonality of codepaths and get a slight bit of extra performance.

I think we've implemented that this way because it was easier to request a specific revision than to parse the response to see which revision we got, but it could be done the other way if you think it's better.

I also wonder if it would have been better to just query master to get the latest revision? That query will have to happen eventually anyway, in the same HTTP request, when we attempt to save the edit (although that code is very far away from the ApiDiscussionToolsEdit code).

Revision 56909335 was saved at 00:39, the next one at 04:08 and the buggy one at 05:55. Does this mean there was several hours of lag between master and replicas? (I didn't think that was realistic, so I ignored this possibility earlier when thinking about edit conflicts.)

One replica server was apparently at least 1h47m lagged, so when the save was initiated at 2020-08-13T05:55:05Z it didn't yet have the revision saved at 2020-08-13T04:08:37Z. And in fact it appears that db1126 was being "slowly repooled" during this time period, if I'm reading https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2020-08-13 correctly. Also MySQL was stopped on db2135 at 05:43.

How is this possible? Was it querying from a different replica sever even more lagged than the one DiscussionTools got that revid from, so that revision really didn't exist there?

Yeah, crazy, right? Yet at 2020-08-13T05:56:59 we have a message in logstash that says, "Server db2082 has 217570.4634769 seconds of lag (>= 6)" so it's apparently possible (that's over 60 hrs!). Usually we'd fall back and query the master before trusting the word of one replica, but because of the oversight/deliberate design choice that @tstarling noted, this code path didn't do that. Obviously we have to pay more careful attention to replica/master distinctions in codepaths that could lead to corruption.

So this scenario can also end with an edit conflict rather than successful merge? If so, we could probably look at error logging data in DiscussionTools to see if we had an increased number of them logged around the time of the corrupted edits, to confirm this.

I didn't see any non-debug logging of edit conflicts in EditPage so I didn't look at this further. But if discussion tools logs edit conflicts, then yes, it would be worth taking a look at it. Since the prerequisite is that DiscussionTools is using a not-the-latest revision as its base, most corruption will prevent the 3way merge with the latest revision that EditPage wants to do (and I'm pretty sure EditPage is careful to talk to the master at the appropriate times). We got 'unlucky' in that the corruption began *below* the edit that was being merged, so the merge succeeded.

I think we've implemented that this way because it was easier to request a specific revision than to parse the response to see which revision we got, but it could be done the other way if you think it's better.

VE appears to sanity check the version it receives against the version it requested. That's certainly worthwhile, in that it would prevent bugs like this before they lead to corruption. (On the other hand, corruption-inducing bugs "should" never happen, so it's not worthwhile in the sense that it never "should" catch any bugs...)

I also wonder if it would have been better to just query master to get the latest revision? That query will have to happen eventually anyway, in the same HTTP request, when we attempt to save the edit (although that code is very far away from the ApiDiscussionToolsEdit code).

Yes, that's what we're going to discuss with @aaron. We don't want to DoS master by forcing every request to go through it, and in my testing I was generally surprised to note that w/ the 3way merge algorithm in EditPage things generally seemed to work Just Fine if DiscussionTools used a not-the-latest revision, so long as Parsoid actually fetched the right wikitext for it. So my inclination is to say that DiscussionTools probably doesn't have to worry about DB lag, just ensure that everything it's using is consistent. Parsoid, on the other hand, should probably try a lot harder to ensure that its revision IDs all match up, which probably will mean falling back to the master at some point.

The Sherlock Holmes quote comes to mind here: "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." https://en.wikiquote.org/wiki/Sherlock_Holmes#The_Adventure_of_the_Blanched_Soldier

I started by hacking at the code to try to reproduce the symptoms, and almost nothing worked: it would throw an error, or succeed without corruption, etc. Eventually I managed to get something which matched the symptoms, but it depended on a pair of (what seemed to me) unlikely situations: that DiscussionTools would get a 2hr-old revision as "latest" and then that getRevisionById() would then return null for that revid. But, as we've dug further into it, it seems like that improbable situation is actually not so impossible after all.

Leaving a comment in response to a question on IRC by @ssastry

  • (Historical) Revision/RevisionStore have since the beginning of time always had a fallback to query the master if an unexpected-known rev ID is not found on the replica.
  • (Historical) the RDBMS layer will generally avoid using a db connection with known lag. It retries a number of times. If most/all replicas are lagged due to maintenance or whatever, it proactively enacts "read only" mode and refuses to perform writes. In this state, it will also not allow submitting of the edit form and things like that. As such, I'd not recommend "re-trying" on (another) replica as this will have already been done by this point. Trying it again will either use the same connection or would require bypassing protection layers that should not be bypassed.
  • Tim and Subbu mentioned that this RevisionStore fallback seems to now be limited to POST requests. I forgot about that, but it makes sense indeed. It is generally undesirable to involve the master on GET requests. From what I vaguely recall, we found that the known use cases for this hack were either since fixed, or limited to POST requests, or mitigated by Chronology Protector already, or cases where the rev ID is bogus/invalid/unknown and thus querying the master is just extra load to generate the same error.
  • From what I can see, RevisionStore correctly handles this and returns null the same way it would for other presumed-unknown or invalid rev IDs. It seems there is a bug in Parsoid in that it doesnt' check this and falls back to using a revision that is imho outside of the expected contract. It should return an error instead.

@cscott Thanks, that helps.

So this scenario can also end with an edit conflict rather than successful merge? If so, we could probably look at error logging data in DiscussionTools to see if we had an increased number of them logged around the time of the corrupted edits, to confirm this.

I didn't see any non-debug logging of edit conflicts in EditPage so I didn't look at this further. But if discussion tools logs edit conflicts, then yes, it would be worth taking a look at it. Since the prerequisite is that DiscussionTools is using a not-the-latest revision as its base, most corruption will prevent the 3way merge with the latest revision that EditPage wants to do (and I'm pretty sure EditPage is careful to talk to the master at the appropriate times). We got 'unlucky' in that the corruption began *below* the edit that was being merged, so the merge succeeded.

It logs erroneous responses from the API client-side via EventLogging, and that should catch edit conflicts too. @Esanders looked it up and apparently we only have "a total of 4 saveFailure events logged in August, 2 x abusefilter and 2 x discussiontools-commentid-notfound" (no edit conflicts). Also, apparently these are sampled and only 20% are recorded, so we easily could have missed them.

  • From what I can see, RevisionStore correctly handles this and returns null the same way it would for other presumed-unknown or invalid rev IDs. It seems there is a bug in Parsoid in that it doesnt' check this and falls back to using a revision that is imho outside of the expected contract. It should return an error instead.

Yes, it's clear that Parsoid had a bug here, which we're fixing: we never considered the possibility of a null return value for a revid that "actually" existed. The question was whether Parsoid should be using a query with READ_LATEST or some other Revision method so that the fallback to master works the way we "expect" it to, considering the importance of fetching the correct wikitext here. I'm a little puzzled by the reference to POST requests -- my reading of RevisionStore::newRevisionFromConds is that the fallback to master is only going to happen if LoadBalancer::hasOrMadeRecentMasterChanges() is true, and Parsoid will not have made any writes to the database before fetching the selser wikitext. (In fact, as a rule Parsoid does not currently make any database writes at all; those are all done by other code.)

Change 621621 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] PageConfigFactory: don't use latest revision if getRevisionById() fails

https://gerrit.wikimedia.org/r/621621

Change 621621 merged by jenkins-bot:
[mediawiki/services/parsoid@master] PageConfigFactory: don't use latest revision if getRevisionById() fails

https://gerrit.wikimedia.org/r/621621

This task should be fixed by the Parsoid team but may need to wait for changes to be deployed. @cscott when should this be deployed?

From @Esanders: Editing will merge some code this week and it will be available on Beta @ppelberg

Change 622176 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to v0.13.0-a7

https://gerrit.wikimedia.org/r/622176

I wrote a new task description with a summary of the two issues that both got handled in this task, for anyone who'd like to catch up on the long discussion.

matmarex updated the task description. (Show Details)

Change 622176 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to v0.13.0-a7

https://gerrit.wikimedia.org/r/622176

I wrote a new task description with a summary of the two issues that both got handled in this task, for anyone who'd like to catch up on the long discussion.

This is helpful; thank you for synthesizing the outcomes of y'alls investigations, @matmarex.

Two, resulting questions for you:

  • 1. Is there any work left to be done to resolve "Issue 1" and "Issue 2" now that 622176 has been +2'd?
  • 2. Are the steps currently listed in the task description's "Testing instructions" section still the best way to test whether "Issue 1" and "Issue 2" have been resolved?
  1. No (or at least we don't think so)
  2. The testing instructions are for issue 1 which we were already pretty confident is fixed. Issue 2 was only ever reproduced by hacking to code to introduce some mock server issues. The best we can do is to re-test edit conflict-type situations (although we already did this with the initial release).
  1. No (or at least we don't think so)

Understood.

  1. The testing instructions are for issue 1 which we were already pretty confident is fixed. Issue 2 was only ever reproduced by hacking to code to introduce some mock server issues. The best we can do is to re-test edit conflict-type situations (although we already did this with the initial release).

Roger that. Then, let's do the following:

  • To ensure "Issue 1" is resolved, let's repeat the steps listed in the task description's "Testing instructions" section. This action is reflected in the "Resulting action" section below.
  • To ensure "Issue 2" is resolved, let's do two things [i]
    • A) Complete a full regression test of the Reply Tool, which we've already planned to do as part of the re-deployment of T252558
    • B) Once 622176 lands, inspect a sample of Reply Tool diffs to ensure there are no instances of content corruption using the run the script @matmarex wrote and shared in T260393#6384098. This work will happen in this task: T261321.

Resulting action

  • @Ryasmeen, in light of what Ed shared in "2." above [i], are you able to follow the steps listed in the task description's "Testing instructions" section to determine whether "Issue 1" (also described in the task description) is indeed fixed? You can re-use the pages you tested on in T259855#6378321 for this testing.

i. @Esanders, please comment if anything here looks unexpected/problematic to you.

  1. No (or at least we don't think so)

Understood.

  1. The testing instructions are for issue 1 which we were already pretty confident is fixed. Issue 2 was only ever reproduced by hacking to code to introduce some mock server issues. The best we can do is to re-test edit conflict-type situations (although we already did this with the initial release).

Roger that. Then, let's do the following:

  • To ensure "Issue 1" is resolved, let's repeat the steps listed in the task description's "Testing instructions" section. This action is reflected in the "Resulting action" section below.
  • To ensure "Issue 2" is resolved, let's do two things [i]
    • A) Complete a full regression test of the Reply Tool, which we've already planned to do as part of the re-deployment of T252558
    • B) Once 622176 lands, inspect a sample of Reply Tool diffs to ensure there are no instances of content corruption using the run the script @matmarex wrote and shared in T260393#6384098. This work will happen in this task: T261321.

Resulting action

  • @Ryasmeen, in light of what Ed shared in "2." above [i], are you able to follow the steps listed in the task description's "Testing instructions" section to determine whether "Issue 1" (also described in the task description) is indeed fixed? You can re-use the pages you tested on in T259855#6378321 for this testing.

i. @Esanders, please comment if anything here looks unexpected/problematic to you.

Re-tested this on Arabic, Swedish and Korean Beta clusters. No dirty diffs found there from my testing. Moving it to Sign-off column.

ppelberg claimed this task.

Re-tested this on Arabic, Swedish and Korean Beta clusters. No dirty diffs found there from my testing. Moving it to Sign-off column.

Great. Resolving this.