Page MenuHomePhabricator

VE leaves around inline data-parsoid on edited content
Open, MediumPublic

Description

Odd bug. I can only reproduce it on a single page and only on officewiki, but it seems to be persistent on that page. (I tried to reproduce in other namespaces, and in mw.o and enwiki, but it worked properly there)

  • Open this gdoc
  • copy the 1 line
  • Open this this officewiki page
  • Click "Edit" to open VE.
  • paste the text.
  • Try to either "switch to source editing" or to "publish changes", and see the error message(s).

Switch to source = Error loading data from server: Error contacting the Parsoid/RESTBase server (HTTP 500)

image.png (425×928 px, 66 KB)

Publish changes = Error contacting the Parsoid/RESTBase server (HTTP 500)
image.png (425×928 px, 69 KB)

Event Timeline

Arlolra subscribed.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.12.05/parsoid-php?id=AW7W3Pnjh3Uj6x1zFgqw&_g=h@44136fa

Return value of Parsoid\Html2Wt\SerializerState::getOrigSrc() must be of the type string or null, boolean returned

#0 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/DOMHandlers/DOMHandler.php(345): Parsoid\Html2Wt\SerializerState->getOrigSrc(integer, integer)
#1 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/DOMHandlers/HeadingHandler.php(31): Parsoid\Html2Wt\DOMHandlers\DOMHandler->getLeadingSpace(Parsoid\Html2Wt\SerializerState, DOMElement, string)
#2 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/WikitextSerializer.php(1274): Parsoid\Html2Wt\DOMHandlers\HeadingHandler->handle(DOMElement, Parsoid\Html2Wt\SerializerState, boolean)
#3 [internal function]: Parsoid\Html2Wt\WikitextSerializer->serializeDOMNode(DOMElement, Parsoid\Html2Wt\DOMHandlers\HeadingHandler)
#4 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/WikitextSerializer.php(1372): call_user_func(array, DOMElement, Parsoid\Html2Wt\DOMHandlers\HeadingHandler)
#5 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/SerializerState.php(669): Parsoid\Html2Wt\WikitextSerializer->serializeNode(DOMElement)
#6 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/SerializerState.php(692): Parsoid\Html2Wt\SerializerState->serializeChildren(DOMElement, NULL)
#7 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/WikitextSerializer.php(1645): Parsoid\Html2Wt\SerializerState->kickOffSerialize(DOMElement)
#8 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Html2Wt/SelectiveSerializer.php(109): Parsoid\Html2Wt\WikitextSerializer->serializeDOM(DOMElement, boolean)
#9 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/WikitextContentModelHandler.php(109): Parsoid\Html2Wt\SelectiveSerializer->serializeDOM(DOMElement)
#10 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/src/Parsoid.php(206): Parsoid\WikitextContentModelHandler->fromHTML(Parsoid\Config\Env, DOMDocument, Parsoid\SelserData)
#11 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/extension/src/Rest/Handler/ParsoidHandler.php(826): Parsoid\Parsoid->html2wikitext(MWParsoid\Config\PageConfig, string, array, Parsoid\SelserData)
#12 /srv/deployment/parsoid/deploy-cache/revs/0910e18d8477587cf44c17a2d500e0f73836c0a6/src/extension/src/Rest/Handler/TransformHandler.php(95): MWParsoid\Rest\Handler\ParsoidHandler->html2wt(Parsoid\Config\Env, array, string)
#13 /srv/mediawiki/php-1.35.0-wmf.8/includes/Rest/Router.php(315): MWParsoid\Rest\Handler\TransformHandler->execute()
#14 /srv/mediawiki/php-1.35.0-wmf.8/includes/Rest/Router.php(285): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\TransformHandler)
#15 /srv/mediawiki/php-1.35.0-wmf.8/includes/Rest/EntryPoint.php(113): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#16 /srv/mediawiki/php-1.35.0-wmf.8/includes/Rest/EntryPoint.php(80): MediaWiki\Rest\EntryPoint->execute()
#17 /srv/mediawiki/php-1.35.0-wmf.8/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#18 /srv/mediawiki/w/rest.php(3): require(string)
#19 {main}

I can only reproduce it on a single page and only on officewiki,

I can reproduce it locally by starting with creating a blank page and then editing with the above steps.

return $start <= $end ? substr( $this->selserData->oldText, $start, $end - $start ) : null;

and then:

If string is less than start characters long, FALSE will be returned.
If length is given and is negative, then that many characters will be omitted from the end of string (after the start position has been calculated when a start is negative). If start denotes the position of this truncation or beyond, FALSE will be returned.

So, it must be that the string is less than $start characters long, so we're returning FALSE and crashing.

Note that if we "fix" this to return null, the DOMHandlers::getLeadingSpace() will set $space to null and then return that, crashing again at the exit where the return value is type checked to string.

So, it must be that the string is less than $start characters long, so we're returning FALSE and crashing.

strlen( $this->selserData->oldText ) === 0 for a blank page so, yes, trivially.

Note that if we "fix" this to return null, the DOMHandlers::getLeadingSpace() will set $space to null and then return that, crashing again at the exit where the return value is type checked to string.

Yes, but return $start <= $end ? substr( $this->selserData->oldText, $start, $end - $start ) : null; already returns null if the first condition isn't met, so it's doubly broken.

But, more to the point, I imagine what's happening is that when you paste the content from the doc, == test banana ==, it helpfully parses the wikitext syntax and leaves data-parsoid attributes behind, which are then used during serialization. So, this is more insidious than it may appear, because the dsr values would correspond to the wikitext from the snippit, not the page.

So VE bug, in that it should strip data-parsoid during paste?

(Still probably my fault, since I think I wrote that particular bit of VE code, but still.)

Checking copy(ve.init.target.docToSave.body.outerHTML), we get,

<body data-parsoid="{&quot;dsr&quot;:[0,0,0,0]}" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><p></p><h2 id="test_banana" data-parsoid="{&quot;dsr&quot;:[0,17,2,2]}">test banana</h2></body>

(Still probably my fault, since I think I wrote that particular bit of VE code, but still.)

Before we point fingers, I should see if the same thing happens when RESTBase is involved, since data-parsoid won't be inlined there. Could be reason why @Quiddity couldn't reproduce on a public wiki. Maybe VE just expects to have to remove the ids that are returned.

Change 554965 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Account for getOrigSrc returning null

https://gerrit.wikimedia.org/r/554965

Maybe VE just expects to have to remove the ids that are returned.

That seems to be the case. If I enable RESTBase locally, the Parsoid ids are dropped as expected.

However, note that this isn't specific to pasted content. VE leaves around inlined data-parsoid on any edited content. It must just assume that all data-parsoid is external now.

Arlolra renamed this task from Parsoid Error HTTP 500 on a single page on officewiki to VE leaves around inline data-parsoid on edited content.Dec 5 2019, 10:49 PM

Maybe VE just expects to have to remove the ids that are returned.

That seems to be the case. If I enable RESTBase locally, the Parsoid ids are dropped as expected.

However, note that this isn't specific to pasted content. VE leaves around inlined data-parsoid on any edited content. It must just assume that all data-parsoid is external now.

I am assuming this is because of the need to preserved syntactic variations (of pasted content) that are currently stored in data-parsoid properties. So, this is a trickier problem. Maybe VE needs to signal the cloned nature of data-parsoid with an attribute or in some other fashion.

I am assuming this is because of the need to preserved syntactic variations (of pasted content) that are currently stored in data-parsoid properties. So, this is a trickier problem. Maybe VE needs to signal the cloned nature of data-parsoid with an attribute or in some other fashion.

No, VE faithful strips parsoid ids in the RESTBase scenario for both pasted and edited content. And it leaves behind inline data-parsoid attributes in both cases.

Change 554965 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Account for getOrigSrc returning null

https://gerrit.wikimedia.org/r/554965

LGoto triaged this task as Medium priority.Apr 17 2020, 4:07 PM
LGoto moved this task from Needs Triage to Needs Investigation on the Parsoid board.

Seems like either:

  1. VE should strip inline data-parsoid to be consistent in the RESTBase and non-RESTBase scenarios, or
  2. We need to rethink how we handle the non-RESTBase case and see if we can make it more similar to the RESTBase case (ie, fetch separate data-parsoid in both cases, and do the recombination client-side before submission the same way RESTBase would)
JTannerWMF subscribed.

We will look at this next fiscal year