Page MenuHomePhabricator

Flow posts being serialized from HTML -> WT without providing Parsoid data-parsoid attributes?
Closed, ResolvedPublic

Description

See https://www.mediawiki.org/w/index.php?title=Topic:Sqiakln6zj2vyqg8&topic_showPostId=sqjazy5nr58n9xdc#flow-post-sqjazy5nr58n9xdc -- the nowiki scenario there can be reproduce on the commandline as follows.

[subbu@earth lib] echo "<nowiki> foo" | node parse --dp | sed 's/<script.*<\/script>//g;' | node parse --html2wt
[warning/html2wt][enwiki/Main Page] Parsoid id found on element without a matching data-parsoid entry: ID=mwAg; ELT=<span typeof="mw:Placeholder" id="mwAg">&lt;nowiki&gt;</span>
<span>&lt;nowiki&gt;</span> foo

[subbu@earth lib] echo "<nowiki> foo" | node parse --dp | sed 's/<script.*<\/script>//g;' | node parse --html2wt | node parse --dp | sed 's/<script.*<\/script>//g;' | node parse --html2wt
[warning/html2wt][enwiki/Main Page] Parsoid id found on element without a matching data-parsoid entry: ID=mwAg; ELT=<span typeof="mw:Placeholder" id="mwAg">&lt;nowiki&gt;</span>
[warning/html2wt][enwiki/Main Page] Parsoid id found on element without a matching data-parsoid entry: ID=mwAg; ELT=<span id="mwAg"><span typeof="mw:Entity" id="mwAw">&lt;</span>nowiki<span typeof="mw:Entity" id="mwBA">&gt;</span></span>
<span>&#x3C;nowiki&#x3E;</span> foo

Note how the "<nowiki> foo" changed to "<span>&lt;nowiki&gt;</span> foo" to "<span>&#x3C;nowiki&#x3E;</span> foo" after 2 roundtrips when data-parsoid is not provided.

Effectively, Parsoid is being asked to serialize Parsoid-HTML but without access to the associated data-parsoid properties.

Is this a RESTBase issue or a Flow issue? Or, is this an expectation that Parsoid should serialize without data-parsoid? If the latter, we should understand this a bit better why.


See also: T113044: Converting {{{foo}}} from wikitext to html to wikitext returns a 500 error

Event Timeline

ssastry created this task.Oct 12 2015, 1:35 AM
ssastry updated the task description. (Show Details)
ssastry raised the priority of this task from to High.
ssastry added a subscriber: ssastry.
Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptOct 12 2015, 1:35 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Legoktm updated the task description. (Show Details)Oct 12 2015, 1:46 AM
Legoktm set Security to None.

This is weird. On the RESTBase side, we throw a 400 when the client sends an HTML to transform, but we cannot get data-parsoid from storage. Is Flow supplying the title and revision to https://{domain}/api/rest_v1/transform/html/to/wikitext{/title}{/revision} ? Because we check for data-parsoid in storage only in that case.

We provide the title, but not revision, so things depending on the title context (e.g. magic words like {{FULLPAGENAME}}) work.

We're not stripping data-parsoid anywhere. Perhaps this is an issue for in-progress edits where they start in VE then switch to wikitext?

As far as I know, Flow expects to receive HTML that contains data-parsoid attributes, so that it can send that kind of HTML back to Parsoid as well. If RESTbase is now sending Flow HTML with data-parsoid remapped to IDs, then that's likely to fail on round-trip unless there's some way other than through revids (which Flow doesn't have) that RESTbase uses to find the ID->data-parsoid mapping again.

In T115236#1720297, @Mattflaschen wrote:

We're not stripping data-parsoid anywhere. Perhaps this is an issue for in-progress edits where they start in VE then switch to wikitext?

Hmm, maybe if the thing RESTbase relies on is the UUID in the <meta> tag in the <head>? It might be that our use of body_only causes that to get lost.

Worth chatting with @mobrovac on IRC and updating the ticket and making any required fixes.

As far as I know, Flow expects to receive HTML that contains data-parsoid attributes, so that it can send that kind of HTML back to Parsoid as well. If RESTbase is now sending Flow HTML with data-parsoid remapped to IDs, then that's likely to fail on round-trip unless there's some way other than through revids (which Flow doesn't have) that RESTbase uses to find the ID->data-parsoid mapping again.

That's likely the problem. RESTBase stores HTML and data-parsoid separately (retrieved via Parsoid's pagebundle API endpoint) and returns only the HTML. On the way back, we send data-parsoid iff the revision is supplied as well; even if we have the title, we cannot know the revision the HTML is derived from.

Hmm, maybe if the thing RESTbase relies on is the UUID in the <meta> tag in the <head>? It might be that our use of body_only causes that to get lost.

We don't need it to be in <head>, we just search for it anywhere in the sent payload. However, this is a factor only if both the title and revision are supplied. As with the revision, there is no point in having the UUID if we don't know the revision.

Elitre added a subscriber: Elitre.Oct 13 2015, 12:05 PM

As with the revision, there is no point in having the UUID if we don't know the revision.

Really? I thought just echoing back the ETag would have been enough? Otherwise how does VE's switch-from-WT-to-VE feature work?

As with the revision, there is no point in having the UUID if we don't know the revision.

Really? I thought just echoing back the ETag would have been enough? Otherwise how does VE's switch-from-WT-to-VE feature work?

Correct, but it needs to be in the form of the If-Match header. I was actually about to suggest the same to Flow - T114548: Provide an API that allows stashing temporary wikitext / html conversion metadata could help you work around this restriction as long as you know the starting title/revision. But for that you need to be able to receive the ETag header emitted by RESTBase and send it back as If-Match, cf T113728: VE should send If-Match header to the VRS to send on to RESTbase for the related VE ticket.

Really? I thought just echoing back the ETag would have been enough? Otherwise how does VE's switch-from-WT-to-VE feature work?

Correct, but it needs to be in the form of the If-Match header.

OK, we can probably do that.

I was actually about to suggest the same to Flow - T114548: Provide an API that allows stashing temporary wikitext / html conversion metadata could help you work around this restriction as long as you know the starting title/revision.

There is no starting revision, because Flow posts are not in the normal revision system. I don't necessarily like that architectural decision, but we can't change that easily. Can we still use this feature even if there is no revid (there is a title, but only so magic words parse correctly)? It seems to me that we could still get an ETag and echo it back in If-Match like T113728: VE should send If-Match header to the VRS to send on to RESTbase requests, unless there's something in RESTbase that requires a revid for this.

Hm, no, scratch that, stashing wouldn't help you actually as you main point is you're missing data-parsoid in html2wt.

A possible solution would be for RESTBase to expose a /transform/wikitext/to/pagebundle public endpoint. Flow would then be able to get back both the HTML and the associated data-parsoid produced by Parsoid. Then, when transforming back from HTML (either html2html or html2wt), you can pass in the original data-parsoid as well. That means you'd need to store data-parsoid yourself, though.

Hm, no, scratch that, stashing wouldn't help you actually as you main point is you're missing data-parsoid in html2wt.

A possible solution would be for RESTBase to expose a /transform/wikitext/to/pagebundle public endpoint. Flow would then be able to get back both the HTML and the associated data-parsoid produced by Parsoid. Then, when transforming back from HTML (either html2html or html2wt), you can pass in the original data-parsoid as well. That means you'd need to store data-parsoid yourself, though.

If this turns out to be the best way of doing it, then maybe instead of a pagebundle endpoint (which is primarily an internal concept), you should provide an api param to the /transform/wikitext/to/html endpoint that does the trick? Not sure what that does to your RB code .. but, just a thought.

If this turns out to be the best way of doing it, then maybe instead of a pagebundle endpoint (which is primarily an internal concept), you should provide an api param to the /transform/wikitext/to/html endpoint that does the trick? Not sure what that does to your RB code .. but, just a thought.

I first thought of this solution, actually :) But I have a conceptual problem with it. /transform/wikitext/to/html returns text/html (as its name suggests). Making it return JSON as well would kind of break that clarity, IMO. Technically it's achievable, though: the client sends the Accept: application/json header and adds an extra query parameter, e.g. /transform/wikitext/to/html?pagebundle=true and gets back something like:

{
  "html": "...",
  "data-parsoid": {
    // ...
  }
}

... which would have the same net result as calling /transform/wikitext/to/pagebundle.

If this turns out to be the best way of doing it, then maybe instead of a pagebundle endpoint (which is primarily an internal concept), you should provide an api param to the /transform/wikitext/to/html endpoint that does the trick? Not sure what that does to your RB code .. but, just a thought.

I first thought of this solution, actually :) But I have a conceptual problem with it. /transform/wikitext/to/html returns text/html (as its name suggests). Making it return JSON as well would kind of break that clarity,

It doesn't have to. You could return the data-parsoid embedded as a <script ..> data-parsoid-json-block here </script>

It doesn't have to. You could return the data-parsoid embedded as a <script ..> data-parsoid-json-block here </script>

Indeed. A simple s/<body>/<body><script language='javascript'>var data_parsoid = $data_parsoid<\/script>/ would do the trick (and work for both complete HTML and body_only requests). That would mean the client would need to extract it out before sending it for html2wt conversion, though.

Hm, now I see Arlo's email about getting data-parsoid in-lined in Parsoid's v3. @ssastry could then an option be to add an extra param to Parsoid's v2 transform API to return it in-lined?

@mobrovac It's already available in v2 by requesting html instead of pagebundle as the format.

http://parsoid-lb.eqiad.wikimedia.org/v2/en.wikipedia.org/html/Main_Page/664887982

@Arlolra is there a version of that URL using transform?

We currently use:

/restbase/local/v1/transform/wikitext/to/html/

ssastry added a comment.EditedOct 14 2015, 9:31 PM

So, however this missing data-parsoid issue is fixed, there is one other aspect that I would like to flag.

As part of the html <-> wt switching in Flow, when going from HTML -> wt (where the HTML came from wt in the first place), if you want to reduce normalization between switches, you want to trigger selective serialization in Parsoid, which currently relies on being able to access old and new HTML, and the original wikitext, just like in VE. I assume this selsering functionality has never been exploited by Flow?

But, while you are looking at this whole issue, worth thinking about that aspect as well. I am not familiar how and where Flow stores its wikitext/html to offer any other concrete suggestions at this time.

So, however this missing data-parsoid issue is fixed, there is one other aspect that I would like to flag.

As part of the html <-> wt switching in Flow, when going from HTML -> wt (where the HTML came from wt in the first place), if you want to reduce normalization between switches, you want to trigger selective serialization in Parsoid, which currently relies on being able to access old and new HTML, and the original wikitext, just like in VE. I assume this selsering functionality has never been exploited by Flow?

Yes, we're aware of this (though I don't know the details of how selser works), and that we're not using it currently. It's worth looking into, though.

But, while you are looking at this whole issue, worth thinking about that aspect as well. I am not familiar how and where Flow stores its wikitext/html to offer any other concrete suggestions at this time.

In production, the stored format is Parsoid HTML. The wikitext is not stored anywhere (except for topic titles, where the format is currently plain text but will soon be a minimalistic subset of wikitext based on Linker::formatLinksInComment).

We store content in External Store; there is no corresponding page for the actual HTML content (there are board and topic core pages, but probably not really relevant). Every edit of any Flow content has a UUID.

In T115236#1726500, @Mattflaschen wrote:

@Arlolra is there a version of that URL using transform?

We currently use:

/restbase/local/v1/transform/wikitext/to/html/

I think this is something that RESTBase would have to support since it always splits apart html and data-parsoid and which is what I was suggesting in T115236#1724854

Change 246427 had a related patch set uploaded (by Catrope):
Temporarily disable RESTbase support to avoid data-parsoid issues

https://gerrit.wikimedia.org/r/246427

Change 246427 had a related patch set uploaded (by Catrope):
Temporarily disable RESTbase support to avoid data-parsoid issues

https://gerrit.wikimedia.org/r/246427

This works around the issue by bypassing RESTbase and going through Parsoid directly, but we should still figure out a long-term approach that also supports selser.

Perhaps we could have a pagebundle-like API that gets the stored HTML content for a Flow UUID?

Change 246427 merged by jenkins-bot:
Temporarily disable RESTbase support to avoid data-parsoid issues

https://gerrit.wikimedia.org/r/246427

Alsee added a subscriber: Alsee.Oct 15 2015, 10:33 AM

Change 249026 had a related patch set uploaded (by Mattflaschen):
Temporarily disable RESTbase support to avoid data-parsoid issues

https://gerrit.wikimedia.org/r/249026

Change 249026 merged by jenkins-bot:
Temporarily disable RESTbase support to avoid data-parsoid issues

https://gerrit.wikimedia.org/r/249026

Arlolra moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.Feb 5 2016, 5:52 AM
J5lx added a subscriber: J5lx.Jul 16 2016, 12:05 PM
GWicke lowered the priority of this task from High to Normal.Oct 12 2016, 5:26 PM
ssastry closed this task as Resolved.Nov 21 2016, 5:49 PM
ssastry claimed this task.

Looks like this issue has been resolved once Flow moved from talking with RESTBase to talking with Parsoid. I am going to close this now. Please open a new ticket if there is anything to be addressed on the Parsoid end.