Page MenuHomePhabricator

Intermittent corruptions on saves, likely related to data-parsoid mis-matches
Closed, ResolvedPublic0 Story Points

Description

See https://en.wikipedia.org/w/index.php?title=Jungle_cat&diff=718896923&oldid=718896396

[[Middle East]] was turned into [[Middle_East]] and [[File:CSC 3800 NEW.png|right|thumb|A jungle cat in Sri Lanka]] was turned into [[File:CSC_3800_NEW.png|right|thumb|A jungle cat in Sri Lanka]], and so forth throughout the page, including on lines that had no other changes.

(There are also scattered capitalization changes (e.g., [[subspecies]] became [[Subspecies]]), which may not have been intended and aren't desirable).

Given the volume and scattered nature of the links that were changed, I suspect that these changes indicate a bug.

Originally reported in Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:48.0) Gecko/20100101 Firefox/48.0

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 6 2016, 5:08 PM
ssastry added subscribers: mobrovac, GWicke, ssastry.

@mobrovac @GWicke this looks like a case where data-parsoid fwded to Parsoid is for the wrong render?

@ssastry: Agreed, the symptoms are consistent with some data-parsoid or element ID issue. One interesting aspect is that only some links are affected, but not all. This could indeed point towards a mis-match between data-parsoid & HTML ids.

There are no errors matching this title or related revisions in logstash, which suggests that Cassandra and RESTBase didn't encounter any abnormal situations while processing this request. I'm afraid that this will be hard to figure out unless we either find more information, or find a way to reproduce this.

ssastry renamed this task from Don't replace spaces with underscores in links to data-parsoid of the wrong render being passed to Parsoid causing intermittent corruptions on saves.May 8 2016, 4:10 AM
ssastry raised the priority of this task from Normal to High.
Elitre added a subscriber: Elitre.May 9 2016, 12:08 PM

I don't think that this is a RESTBase issue. First of all, we didn't touch/deploy anything related to VE or Parsoid recently. Secondly, the html and data-parsoid stored in RB for the affected revision of the Jungle_cat article seem to match:

Arlolra added a subscriber: Arlolra.May 9 2016, 8:26 PM

If you look at the history for these edits, including the ones referenced on the VE/Feedback page, it always seems to happen when a user makes a few subsequent edits in a flurry.

Bgwhite added a subscriber: Bgwhite.May 9 2016, 8:57 PM

CheckWiki scans for HTML syntax for lists and section headers. A scan is made everyday at 0z. The amount of <li> and <h*> in articles dramatically increased on the May 5 0z scan. Scans since then have shown dramatically higher numbers. Something changed on May 4 as these errors were not present before then.

For awhile, VE has a problem of adding <h*> into articles, but these occurred when new section headers were added. In the above problem, section was previously present and nothing around it changed.

For the category barf VE is producing, there was a similar bug, T72894, over a year ago.

Parsoid had a deployment on the 4th but it only updated a cached file. There was a more significant change on the 2nd, but that doesn't seem to fit the timeline.

However, there was an outage on the 4th, in T134537, and maybe this is somehow a consequence of what happened there.

@Bgwhite is the data you're referencing publicly available?

Looking at the deployment calendar & the server admin log, Wed 4th saw the following deployments:

Then, there was a RB outage / slowness starting Wed 5th, around 02:30-03:30UTC: T134537

Later on Thursday the 5th (09:22 UTC), there was also a RESTBase deploy:

The Parsoid change looks minor, but there were some data-parsoid related changes two days earlier. However, if this issue started only on the 4th as indicated by @Bgwhite, then those should not be relevant.

@Jdforrester-WMF @Krenair @Esanders, were there any VE changes deployed that could be potentially related to this? Anything related to how ETags are updated in VE or forwarded to the API, especially on repeat edits?

@Arlolra

is the data you're referencing publicly available?

Webpage is at: https://tools.wmflabs.org/checkwiki/cgi-bin/checkwiki.cgi?project=enwiki&view=project
Error #49 is section headers. #12 is lists. For an individual error, click on "description" to get a current and "done" article listings.

#49 rarely goes over 2 new errors a day.
#12 rarely goes over 2 new errors a week.

@Bgwhite thanks.

What I said in T134593#2277888 still seems to hold, though it isn't on every subsequent edit and doesn't need to happen all that quickly (couple minutes).

Here's a good example,
https://fr.wikipedia.org/w/index.php?title=Stade_Oc%C3%A9ane&action=history

So, there must be some editing pattern causing dp to be misattributed.

Krenair removed a subscriber: Krenair.May 9 2016, 11:19 PM

were there any VE changes deployed that could be potentially related to this? Anything related to how ETags are updated in VE or forwarded to the API, especially on repeat edits?

I did a quick look and didn't find anything related. Certainly don't remember any changes around that area.
(also, use this account for VE things please)

VE sending an outdated ETag / If-Match header could cause symptoms like this. RESTBase logs indicate that it did not fall back to the inline mw:TimeUuid tag, which very likely means that an ETag header was passed at all times. However, it's still possible that this ETag was outdated in some cases.

@Pchelolo & I tried to provoke something like that manually by quickly editing the same page in sequence, but didn't have much luck.

We are now looking into adding a sanity check comparing the mw:TimeUuid with the passed-in ETag to establish whether there are edits where those two do not agree. This should establish whether this is a RB client issue or not.

A patch for the sanity check is now up at https://github.com/wikimedia/restbase/pull/608. If things go to plan, we'll deploy this tomorrow.

A patch for the sanity check is now up at https://github.com/wikimedia/restbase/pull/608. If things go to plan, we'll deploy this tomorrow.

It has been deployed as part of today's deploy. We will monitor the situation in the upcoming period and report back should there be any findings.

Jdforrester-WMF set the point value for this task to 0.May 10 2016, 4:49 PM

Some more data:

GWicke renamed this task from data-parsoid of the wrong render being passed to Parsoid causing intermittent corruptions on saves to Intermittent corruptions on saves, likely related to data-parsoid mis-matches.May 10 2016, 6:07 PM

@Bgwhite, have you seen new instances of this issue in the last 24 hours?

@Bgwhite, have you seen new instances of this issue in the last 24 hours?

I'll check after the Checkwiki 0z run get finished.

I did not see any new instances.

@GWicke did you deploy anything that would explain why it is so?

@Bgwhite, thanks!

@Arlolra, we deployed a new version half way through the last 24 hours, but it did not contain any changes that are likely to affect the behavior in this issue.

For completeness, we did another deploy this morning which improves TID checks and logs if a (literal) null value is received for it, so hopefully this will make it easier to get to the bottom of this.

Nothing visible in the logs at this point, though.

So this is "fixed" (worked around) in production, but we don't yet know the cause?

There are no fixes or work-arounds in RB. The only related changes we deployed added some debug prints to narrow down possible causes, which haven't emitted anything so far. This means that the things those debug prints tested for were not the culprit.

There are no fixes or work-arounds in RB. The only related changes we deployed added some debug prints to narrow down possible causes, which haven't emitted anything so far. This means that the things those debug prints tested for were not the culprit.

https://github.com/wikimedia/restbase/pull/608/files looks like it didn't set the contents of tid in that error state, though?

https://github.com/wikimedia/restbase/pull/608/files looks like it didn't set the contents of tid in that error state, though?

If you mean the new check / log case in https://github.com/wikimedia/restbase/pull/608/files#diff-d91965930eda1e26c9b5db0d3041c4acR597, then that only triggers if the tid is already truthy. Previously, we didn't even look at the mw:TimeUuid value in that case, as the supplied tid takes precedence. There is no change in behavior.

See also T135171 for the related investigation into missing if-match headers in some VE requests.

ssastry moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.May 23 2016, 3:10 PM
GWicke added a comment.EditedJul 19 2016, 6:06 PM

Here is a summary of the status quo:

  1. The way the RESTBase code is structured guarantees that a) data-parsoid is only loaded when the original HTML was already found (matching the exact revision and tid), and b) data-parsoid is loaded using the exact same revision / tid specified by VisualEditor in its save request.
  2. The change pattern indicates that a large part of the page was not serialized with selser. This suggests that the original HTML differed, triggering non-selser serialization. This would primarily be the case if VisualEditor passed an incorrect title / base revision / tid tuple to RESTBase. To me, this seems to be the most likely candidate hypothesis.

We already have instrumentation in RESTBase comparing the header-supplied revision/tid with that in the HTML head <meta> element. This check did not trigger for any problematic edit so far, suggesting that VisualEditor sends a consistent header & HTML head section.

@AlexMonk-WMF, is there any chance that VE itself, or the speculative save feature, would combine an etag & HTML head from an older revision with HTML from a newer revision? Are there ways to instrument VE to rule out revision mis-matches? Are there specific usage patterns like repeat edits that trigger this?

AlexMonk-WMF removed AlexMonk-WMF as the assignee of this task.Jul 22 2016, 8:32 PM

@AlexMonk-WMF, is there any chance that VE [...] would combine an etag & HTML head from an older revision with HTML from a newer revision? Are there ways to instrument VE to rule out revision mis-matches? Are there specific usage patterns like repeat edits that trigger this?

I've reviewed the code and have no reason to think it'd mismatch revision with etags from other revisions, other than the known issues where the browser gives us broken null etag headers. That said I'm not really sure what to look for here, without reproduction steps or any specific tests to perform there's not much I can suggest. We can't really prove the absence of a bug (for example, who knows if some broken browser is randomly corrupting data as we receive/process/send it?)...

One thing I did recently notice is that sometimes quick repeat edits (save, then hit the edit tab again) in VE end up loading an old revision. The last time I experienced this, a banner saying "you are editing an old revision" popped up at the top. Could there be a race condition behind this, possibly related to MySQL replication lag? Could this lead to a state in VE where its idea of etag & HTML head diverges from the body content?

@AlexMonk-WMF, is there any chance that VE itself, or the speculative save feature, would combine an etag & HTML head from an older revision with HTML from a newer revision? Are there ways to instrument VE to rule out revision mis-matches? Are there specific usage patterns like repeat edits that trigger this?

I think instrumenting this would be a good idea, but it would probably have to be on the RESTbase/Parsoid end. They (or one of them) behave differently when the E-Tag/head mismatches with the content, so it should be easier to detect it there (or somewhere on the server side) than on the client side.

One thing I did recently notice is that sometimes quick repeat edits (save, then hit the edit tab again) in VE end up loading an old revision. The last time I experienced this, a banner saying "you are editing an old revision" popped up at the top. Could there be a race condition behind this, possibly related to MySQL replication lag? Could this lead to a state in VE where its idea of etag & HTML head diverges from the body content?

If you saw that, then the MediaWiki API response knew that the revision was old. I have found buggy behavior that would make VE do an oldid edit the second time, but I couldn't get that to result in an E-Tag mismatch: we have logic ensuring the revid in the about attribute of the <html> tag in the RB response matches what we expect. From reading the code I don't see how either the E-Tag or the <head> of the wrong revision could end up being used for saving, even when editing multiple times, and I can't reproduce that behavior either.

I have found buggy behavior that would make VE do an oldid edit the second time

I reported this as T141330: ?oldid=mostRecentRevId causes strange behavior on second edit and went down a small rabbit hole of things that scared me (stale state etc). I didn't find anything that could explain E-Tag or <head> mismatches, but I found enough other concerning cases of stale data being used and generally weird things happening in the oldid-equals-current edge case that it may be worth trying my patch for that bug and seeing if that fixes (or works around / stops exposing) this issue.

GWicke added a comment.EditedJul 28 2016, 5:36 PM

@Catrope, in my case I was editing the latest version of the article (without oldid parameter), but got an "you are editing an old version" warning after re-opening VE right after a save. I'm not sure if the URL had changed to include an oldid at that point.

I guess the most obvious next thing to do is to wait & see if the issue is fixed. I don't have other ideas for instrumentation in RB, as we are already checking for consistency between If-Match & HTML head information. We don't have an easy way to figure out if the edited body is matching the supplied revision / uuid. Parsoid is checking just that (with DOM diffing), and seems to think that it does not match when this error occurs, as evidenced by it not using selser for large parts of the page.

Culling out couple more recent links from the VE/Feedback archives so it is easier to reference them without having to go digging through archives:

Here is an older one:

So, Arlo's observations earlier are correct that this seems to happen *only* when one VE edit is followed by another VE edit. Looks like that is where the focus ought to be in VE & RESTBase.

Since <link> and <meta> tags (especially those for default sort, category links, language links, etc.) should never be generated as HTML tags, if during serialization we detect those tags getting a 'html' syntax flag, Parsoid could abort serialization and log an error. This can do two things: (a) avoid corruptions in some of these scenarios -- not all will be prevented (b) continue to log errors for debugging.

Since the VE changes, there have been no new reports on https://en.wikipedia.org/wiki/Wikipedia:VisualEditor/Feedback, which is encouraging. Lets keep this task open for a few more weeks.

GWicke closed this task as Resolved.Aug 22 2016, 6:20 PM
GWicke claimed this task.

It looks like there have been no new reports since the VE change was deployed. I am going to call this resolved.

Please reopen in case the problem resurfaces.

Thank you everyone!