Page MenuHomePhabricator

Support various conversions in Parsoid's pb2pb endpoint
Open, MediumPublic

Description

Use cases:

  • tidy content (just a parse & re-serialize) without any flags, as a nodejs-only alternative to the Java service: T89331
  • re-expand templates (all, or specific ones): T114409, T151720
  • re-sanitize content
  • re-wrap sections and re-build TOC metadata: T114072
  • migrate to latest content version (by mime type): T104462
  • possibly, update red link information: T39902
  • Convert the language variant: T43716

Not all use cases would apply in all situations, but there might be use cases that would benefit from the ability to combine several modes.

Request format

The expected content format is indicated with an Accept header. The format applies to the pagebundle format overall, and versions are kept in sync with nested html, data-parsoid, section offset & later data-mw components. In line with the regular pagebundle format, each component has a format content-type provided as part of its sub-structure as well:

html: {
  headers: {
    'content-type': 'text/html; charset=utf-8; profile="https://mediawiki.org/wiki/Specs/DOM/1.2.1"'
  },
  body: '<html>...</html>'
}

To indicate which updates to apply, the request for the /pagebundle/to/pagebundle/ end point will contain an updates property:

updates: {
  transclusions: true, // Could specify which templates later.
  media: false,   // Could specific the exact image to update later.
  redlinks: {
    added: ['Some_Title_as_dbkey'],
    removed: []
  }
}

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have started to think a bit about the API requirements for this. There is a lot of overlap with the current pagebundle2pagebundle POST entry point:

  • updateMode already signals whether to update images, templates etc.
  • original has the original html/data-parsoid and revid.

So, I'm wondering if we could extend this to cover pagebundle2pagebundle functionality in general.

Strawman:

  • Make updateMode an updates object. Example value:
updates: {
  templates: true, // Could specify which templates later.
  images: false,   // Could specific the exact image to update later.
  format: true, // Update to latest format
}

Alternatively, we could make format implicitly the latest, but support asking for a specific format with a format update:

updates: {
  templates: true, // Could specify which templates later.
  images: false,   // Could specific the exact image to update later.
  format: 'text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/1.2.1"', // Specific format
}

To clarify, the "current pagebundle2pagebundle POST entry point" is the /transform/wikitext/to/pagebundle/ entry point that accepts original / html/data-parsoid to speed up re-renders (note. it's still unclear if we're making use of that).

That's definitely overlapping functionality which should be moved to the more appropriately named endpoint.

@Arlolra, what are your thoughts on the updates property proposal in T114413#2182361?

Red links would supply an object with details. Strawman:

updates: {
  redlinks: {
    created: ['Some_Page'],
    deleted: [],
  }
}

/cc @Pchelolo @ssastry @mobrovac

  • We will support only the last one or two versions and return http 404 for everything else. For example, we are currently at version 1.2.0 and with the next big bump to 1.3.0 (or should it be 2.0.0?) because of a data-mw split, we will only support requests for 1.2.0.

Are there technical limitations to this? My concern here is that if go down this road, then we are somewhat hypocritical vis-a-vis clients: we let them to specify which format they want, but at the same time we are not allowing them to specify a format older than current -1. Would re-purposing this task to extend to pagebundle2pagebundle help?

That's definitely overlapping functionality which should be moved to the more appropriately named endpoint.

+1 !

what are your thoughts on the update property proposal in T114413#2182361?

Looks sane to me and giving hints to Parsoid should make the job much easier. The one caveat is that we definitely need to define and document these options consistently.

  • We will support only the last one or two versions and return http 404 for everything else. For example, we are currently at version 1.2.0 and with the next big bump to 1.3.0 (or should it be 2.0.0?) because of a data-mw split, we will only support requests for 1.2.0.

Are there technical limitations to this? My concern here is that if go down this road, then we are somewhat hypocritical vis-a-vis clients: we let them to specify which format they want, but at the same time we are not allowing them to specify a format older than current -1. Would re-purposing this task to extend to pagebundle2pagebundle help?

It is really not reasonable to expect support for really old HTML versions forever. There will be far too many maintenance headaches. Minor version bumps might be easier to support for longer time than major version bumps. This problem of returning old versions gets harder especially if there is no easy way to go directly from HTML v_new -> HTML v_old ... we need to leave around if-conditions in the code to parse wikitext to the old version. In short, undesirable. The versioning support is really there to give clients breathing room and time to upgrade to newer versions, but not really a guarantee that the old version will be supported forever.

@ssastry Yeah. Also, in practice I don't think incompatible changes will be made that often. We should also then put a support lifespan to the format wiki pages.

For example, we are currently at version 1.2.0 and with the next big bump to 1.3.0 (or should it be 2.0.0?)

I think it should be 2.0 if we want to follow semantic versioning.

@ssastry Yeah. Also, in practice I don't think incompatible changes will be made that often. We should also then put a support lifespan to the format wiki pages.

Exactly .. which is why supporting at most 1 previous major version doesn't seem like a big deal.

Change 288021 had a related patch set uploaded (by Arlolra):
WIP: Provide HTML2HTML endpoint in Parsoid

https://gerrit.wikimedia.org/r/288021

@GWicke I assume the format you're specifying above was intended to be for the output. However, in https://gerrit.wikimedia.org/r/#/c/286072/, we determined the output would be versioned by the accept header.

I started moving the overlapping functionality, T114413#2182567, in https://gerrit.wikimedia.org/r/#/c/288021/

What would be good to sort out is a unified way of inputting the various things we want to support,

  1. a reparse (currently you pass in original.html and a revid, and we fetch the wikitext)
  2. an revision update (currently you pass in previous.html and a revid, again we fetch the wikitext)
  3. new a content version update (presumably we'll just be given the html)

Presumably in 1) and 2), if we aren't given a revid, we can serialize the html we've been given before parsing.

Considering T98995, moving that functionality w/o an HTTP API version bump should be fine.

Since these are all passing in html (and the former part in the :from/to/:format path), we should try to do away with the original/previous nesting.

body {
  html { ... }
}

Since we've already got this framing syntax, that should be enough to infer that a content version needs updating.

body {
  html {
     body ''
     content-type ''  // if this doesn't match what we got from the accept header, update
  }
}

@Arlolra & myself discussed this on IRC earlier. Here is a summary:

  • As @Arlolra mentioned, the Accept header is used for content format negotiation, including content format conversions. The profile parameter will be pagebundle-specific, but will share the version with HTML, data-parsoid & data-mw.

Body structure:

  • The updateMode is replaced with an updates object as sketched in T114413#2182361, detailing requested updates to original HTML (if provided).
  • One of previous or original content is provided, each of which contains html, data-parsoid (and later, data-mw) sub-objects with headers & body, as well as (for previous), the revision ID.

Currently, there's no such thing as updateMode, that's just a variable in RESTBase.

RESTBase sends a string as { update: '...' } that's derived from x-restbase-mode header.
https://github.com/wikimedia/restbase/blob/ed9bab4823c077dba6a416eaccf9d65b0f8fce4c/sys/parsoid.js#L221-L223

However, all this time, Parsoid thought it was getting an object called update,
https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L462-L469

which is similar to the updates there, but it stopped at the first matching key and used that as the mode.

There's a test for this in Parsoid here,
https://github.com/wikimedia/parsoid/blob/master/tests/mocha/api.js#L301-L335

We had agreed upon that in T75955#940444

Also, in those examples, you're using images and templates, but Parsoid seems to expect files and templates.

@Arlolra, I had totally forgotten that I had proposed basically the same thing before. Signs of getting older, I guess ;/

So, shall we go for plural (updates) in the new version?

It looks like the CacheUpdateJob was sending files and templates,
https://github.com/wikimedia/mediawiki-extensions-Parsoid/blob/4c0d4ff1106236a2aacf202054ef2dc943e7047d/ParsoidCacheUpdateJob.php#L166-L167

I'm fine with the plural updates. What do you suggest for the modes, since we're start anew here?

What do you suggest for the modes, since we're start anew here?

Perhaps media and templates, to be a bit more general than files?

ssastry raised the priority of this task from Medium to High.May 27 2016, 11:24 PM
ssastry moved this task from Needs Triage to In Progress on the Parsoid board.

Change 288021 had a related patch set uploaded (by Arlolra):
T114413: Provide HTML2HTML endpoint in Parsoid

https://gerrit.wikimedia.org/r/288021

Perhaps media and templates, to be a bit more general than files?

@ssastry is pointing out in review that this should be transclusions instead of templates. Are you ok with that?

Change 293351 had a related patch set uploaded (by Arlolra):
WIP: Convert between content versions

https://gerrit.wikimedia.org/r/293351

@ssastry said in review,

I understand the reasoning behind using the html2html for parsing a new revision based on an edit ... but, it still feels a tad awkward .. but, I think it might be ameliorated by having some explicit indication of what kind of html2html request this is .. i.e. rather than do inferring html2html type and doing implicit things based on what kind of information is present, make the html2html type explicit and assert presence of other information.

@ssastry is pointing out in review that this should be transclusions instead of templates. Are you ok with that?

We'll go with that for now.

Change 288021 merged by jenkins-bot:
T114413: Provide HTML2HTML endpoint in Parsoid

https://gerrit.wikimedia.org/r/288021

This endpoint exists now so I'm considering closing this as done.

Should we open individual tasks for the particular use cases?

Since this is a tracking task covering several html2html use cases, should we keep it open? We already have specialized tasks (ex: T39902 for red links), but I think it is helpful to point to this task for the overall API concept / plan.

Arlolra reopened this task as Open.EditedAug 9 2016, 6:39 PM

@GWicke Can you cleanup the description/title for the tasks intended use now then? And dependent tasks.

GWicke renamed this task from Provide HTML2HTML end point in Parsoid to Pagebundle2pagebundle (html2html) end point in Parsoid.Aug 9 2016, 6:50 PM
GWicke updated the task description. (Show Details)

@Arlolra: Summarized the discussion in the task description.

Thanks, but what I meant was,

s/Pagebundle2pagebundle (html2html) end point in Parsoid/Tracking task for various conversions in the pb2pb endpoint/

"Pagebundle2pagebundle (html2html) end point in Parsoid" is done, that's why I closed this.

Also, remove all the data-mw related tasks from being dependent on this, since they no longer are.

Unless, maybe I'm misunderstanding your intention.

"Pagebundle2pagebundle (html2html) end point in Parsoid" is done

Many of the use cases we intend to support with this end point are not done yet, which is why I think it is useful to keep this task open. By specifying how the pb2pb end point is designed to work, this task is part of the design spec for those other use cases.

Arlolra renamed this task from Pagebundle2pagebundle (html2html) end point in Parsoid to Support various conversions in Parsoid's pb2pb endpoint.Aug 9 2016, 7:12 PM
Arlolra added a project: Tracking-Neverending.

Many of the use cases we intend to support with this end point are not done yet, which is why I think it is useful to keep this task open.

But there isn't really agreement that we actually intend to support any of those use cases. See T114413#2126753 for what I considered done in this task.

By reopening, we're broadening the task, and it should reflect that we already have a pb2pb endpoint that satisfies the initial requirements.

Arlolra lowered the priority of this task from High to Medium.Nov 4 2016, 5:27 PM
Arlolra moved this task from In Progress to Needs Triage on the Parsoid board.

LanguageConverter support might use this endpoint. Some of the "next generation wikitext" proposals could use it, too.

Not saying that they should, necessarily. Just adding to the possible use cases.

Proposing:

updates: {
  variant: { source: 'en', target: 'en-x-piglatin' }
}

for language conversion. Special handling of null and undefined for source, and with a special 'x-roundtrip' value for target which will attempt to convert the (edited) html back to the original source variant. See T43716#4262541 for details.

What's the status of version-changing pb2pb endpoint in Parsoid? Once we have at least one transform working in beta I can start implementing the RB side.

We would also need some kind of version matrix endpoint so that. RESTBase can know which versions could be transformed to which versions in Parsoid, but that is more like an optimization.

Current version on the implementation deployed to beta makes me a bit confused.

I thought the plan was to downgrade HTML on the way out of RESTBase and on the way back provide downgraded new HTML and latest old HTML&Data-Parsoid from storage for a transform T202666.

However, the current implementation of pb2pb endpoint requires posting Data-Parsoid for the downgrade. Is it absolutely required to have Data-Parsoid for the downgrading to be possible? If no, could we please not require it? Otherwise, logic in RESTBase gets more complex since we need to fetch Data-Parsoid prior to downgrade as well, and this adds one more sequential read to the read-downgrade-serve path, that adds additional latency.

Also, if version-matching Data-Parsoid is required for the downgrade, I'm not sure how T202666 could work, then we'd need to temporarily stash downgraded Data-Parsoid in RESTBase under a fake TID.

I thought the plan was to downgrade HTML on the way out of RESTBase and on the way back provide downgraded new HTML and latest old HTML&Data-Parsoid from storage for a transform T202666.

Still the plan, I will try to get to the serialization part today.

Is it absolutely required to have Data-Parsoid for the downgrading to be possible?

It's not required per se, it's just the input validation that's complaining. Could you pass in an empty data-parsoid following the assumed structure so that we don't need to special case it?

Also, if version-matching Data-Parsoid is required for the downgrade, I'm not sure how T202666 could work, then we'd need to temporarily stash downgraded Data-Parsoid in RESTBase under a fake TID.

Right, fear not.

It's not required per se, it's just the input validation that's complaining. Could you pass in an empty data-parsoid following the assumed structure so that we don't need to special case it?

Seems a bit hacky.. how hard would it be to special-case? I'm ok to follow this path too with some good commenting, but still, a bit hacky..

Seems a bit hacky.. how hard would it be to special-case? I'm ok to follow this path too with some good commenting, but still, a bit hacky..

Judging by the number of uses of the word "hacky" in this sentence, we've agreed to just fix things in Parsoid :)

ok. 2 appearances of the word "hacky" change @Arlolra mind. Good to know :)

Thank you. I'll be testing it with empty Data-Parsoid for now, there's still some work to be done on RB side

@Arlolra I think there's one more issue there:

return an error 406 if the requested major version cannot be satisfied,

Right now if I request 1.9.0 from Parsoid, it returns 2.0.0 instead of throwing. We can fix that in RB, but I think Parsoid should throw too.

Right now if I request 1.9.0 from Parsoid, it returns 2.0.0 instead of throwing. We can fix that in RB, but I think Parsoid should throw too.

That's because of this strictAcceptCheck: false. See https://github.com/wikimedia/mediawiki-services-parsoid-deploy/commit/a8e759fc80b5fb9be4bcb9c4d3c1f83395249dc1 and https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L217-L227

Sorry, I will change that for the beta cluster and restart Parsoid there.

Thank you! Ping me on IRC, I'll finalize restbase PR

https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/deploy/+/460449 is deployed to the beta cluster

arlolra@deployment-parsoid09:/etc/parsoid$ curl -I -H 'Accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/1.9.0"' localhost:8000/en.wikipedia.org/v3/page/html/Main_Page/847600508
HTTP/1.1 406 Not Acceptable
X-Powered-By: Express
Access-Control-Allow-Origin: *
Content-Type: text/html; charset=utf-8
content-language: undefined
Vary: undefined, Accept-Encoding
Content-Length: 300
ETag: W/"12c-oEyqh1BpjIAycv2rl+7LdnLRQoo"
Date: Thu, 13 Sep 2018 23:09:43 GMT
Connection: keep-alive

Should be good to go.

Noted that these values are suspicious and being fixed,

content-language: undefined
Vary: undefined, Accept-Encoding

Change 460454 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Ensure that Content-Language and Vary headers are not "undefined"

https://gerrit.wikimedia.org/r/460454

Change 460454 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Ensure that Content-Language and Vary headers are not "undefined"

https://gerrit.wikimedia.org/r/460454