Page MenuHomePhabricator

Detect when a page save is likely to be a corruption
Open, LowPublic

Description

Add logic to the selser code that to detect when unusual wikitext is being emitted as part of the serialization. For example, meta and link tags with mw:* attributes are a very likely indicator that the data-parsoid and HTML versions that Parsoid received from RESTBase are mismatched.

For now, it is worth returning a http 4xx and log info in logstash for investigation. VE could maybe display the error to the user or something else more appropriate.

Event Timeline

ssastry created this task.Dec 13 2017, 1:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2017, 1:49 AM
ssastry triaged this task as High priority.Dec 13 2017, 1:51 AM
ssastry updated the task description. (Show Details)
ssastry moved this task from Needs Triage to html2wt on the Parsoid board.Dec 18 2017, 9:40 PM
Deskana lowered the priority of this task from High to Medium.Feb 13 2018, 3:35 PM
Deskana moved this task from To Triage to Freezer on the VisualEditor board.
ssastry moved this task from Backlog to Selser on the Parsoid-Edit-Support board.Sep 20 2018, 4:48 PM
Aklapper edited projects, added Parsoid; removed Parsoid-Edit-Support.Feb 29 2020, 5:15 PM
ssastry lowered the priority of this task from Medium to Low.Mar 5 2020, 10:18 PM
ssastry moved this task from html2wt to Feature requests on the Parsoid board.
ssastry added a subscriber: Arlolra.
cscott added a subscriber: cscott.Aug 17 2020, 4:31 PM

An idea here is to find some <meta typeof=..../> which is rendered invisibly by VE (ie, not as an icon indicating uneditable content, or whatever) -- add support for an invisible element if we can't find an existing one in VE to hijack.

Add both an HTML attribute *and* a data-parsoid attribute with a random nonce, like:

<meta integrity="cafecafe" data-parsoid='{"integrity":"cafecafe"}'/>

Verify during save processing that the two integrity tags match, and fail with a 500 error if not; also check that the "original" HTML and the "edited" HTML match. This should catch mismatches during data-parsoid recombination as well as mismatches between edited and original HTML.

An idea here is to find some <meta typeof=..../> which is rendered invisibly by VE (ie, not as an icon indicating uneditable content, or whatever) -- add support for an invisible element if we can't find an existing one in VE to hijack.

Add both an HTML attribute *and* a data-parsoid attribute with a random nonce, like:

<meta integrity="cafecafe" data-parsoid='{"integrity":"cafecafe"}'/>

Verify during save processing that the two integrity tags match, and fail with a 500 error if not; also check that the "original" HTML and the "edited" HTML match. This should catch mismatches during data-parsoid recombination as well as mismatches between edited and original HTML.

So, this solution has a drawback in that this will fail to detect corruption when the ids all line up but the wikitext is from the wrong revision (ex: T259855#6390408 and T227116).

So, I think we probably still consider a solution that is independent of that and is more at the selser level.

One strategy is as follows: whenever selser emits $wikitext for a DOMNode $node, we add sanity checks that verifies that for $node->nodeName, substr($wikitext, $some-length) matches expectations of what $node->nodeName would serialize to.

Ex: if $node->nodeName == 'ol', we expect substr($wikiext, 1) === '#'. If $node->nodeName == 'table', we expect substr($wikitext, 2) === '{|' and so on.

If these expectations fail, it is likely corruption.

ssastry updated the task description. (Show Details)Aug 17 2020, 7:38 PM

Wait, how would be random nonce line up if the wikitext is from the wrong revision?

Another idea: there's a warning about 'funny etag' in ApiParsoidTrait -- that should probably be a 500 error if etag is null, not a warning.

Wait, how would be random nonce line up if the wikitext is from the wrong revision?

Unless the wikitext used to generate the HTML is somehow part of the nonce, the nonce will line up.