Support language variant conversion in Parsoid
Open, LowPublic

Description

This is the top-level tracker bug for LanguageConverter support in Parsoid.

Plan of record, roughly:

  • Phase 1: Parse all LC constructs into DOM (and round-trip them).

    This is sufficient to allow VE to edit LC wikis in same fashion as wikitext editor, w/ mix of variants displayed during editing.
  • Phase 2: Actually run conversion on the DOM, using the parsed constructs.

    This is sufficient for "read-view" use of Parsoid output, for example in mobile frontend, for google indexing, etc.
  • Phase 3 (speculative): Use selective serialization to allow VE to operate on the converted text.

    This allows "single variant" editing, without the chaotic mix of variants shown in wikitext editing, and uses selective serialization to preserve the original variant of unedited text.
  • Phase 4 (speculative): Introduce new LC syntax or Glossary features which are a better match for future plans.

    This would avoid the "from this point forward" behavior of LC rules, which complicates incremental update, as well as avoiding the use of templates as a workaround for per-page glossaries. We might also introduce more pervasive language tagging in the source, to better match LC uses where character set can't be used to distinguish variant (toy example: pig latin -vs- english).

Details

Reference
bz41716

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
jmadler added a subscriber: jmadler.Jan 6 2016, 5:13 AM
brion added a subscriber: brion.Aug 24 2016, 8:57 PM

We discussed this a little in Archcom meeting; here's some quick notes:

  • add a 'phase 0' to define a 'sane subset' of the existing markup behavior that we recommend supporting
  • figure out how to do the 'phase 1' in parsoid-land <- this gets us to a place where we might be able to use VE on non-Chinese wikis using LC
    • then figure out how to get VE to make the definition blocks display/edit sanely (phase 2) <- should be enough to get Chinese editable but with the mixed scripts
  • later, figure out how (or if) to do full VE-side application of conversion for display during editing without changing the underlying data that gets saved back (phase 3) -- this is potentially *very* hard.

The phase 0 syntax limitation would essentially mark some things as 'undefined behavior' for a spec -- such as using the vocab definitions to change markup or html elements -- and would make the display simpler and the editing MUCH simpler.

LanguageConverter markup is back on my plate over in Parsoid land. I'll be dusting off the existing Parsoid patches as a first step.

LanguageConverter markup is back on my plate over in Parsoid land. I'll be dusting off the existing Parsoid patches as a first step.

Excellent! It seems like Brion's phase 0 ("define a 'sane subset' of the existing markup behavior that we recommend supporting") seems like something that should be filed as an RFC. Should we try to do that as part of T142803 or does that need it's own RFC?

Change 140235 had a related patch set uploaded (by C. Scott Ananian):
WIP: parse language converter markup.

https://gerrit.wikimedia.org/r/140235

Liuxinyu970226 changed the task status from Open to Stalled.Jan 1 2017, 5:43 AM

Stalled per that patch, "Main test build failed." or "Merge Failed." happened for too many times

Legoktm changed the task status from Stalled to Open.Jan 1 2017, 10:04 AM
Legoktm added a subscriber: Legoktm.

Stalled per that patch, "Main test build failed." or "Merge Failed." happened for too many times

It's not stalled. Unless you actually know that a task is stalled, please don't mark it as so.

@cscott I was following the comments / commit message of this ticket. I see your Gerrit patch, it looks like it is waiting on:
https://gerrit.wikimedia.org/r/#/c/333997/

Which itself is waiting on a lot of pages to be fixed up with some additional markup:
https://www.mediawiki.org/wiki/Parsoid/Language_conversion/Preprocessor_fixups

Is that about the state of things?

Is that process being automated or did you figure out a solution? Are there any other dependencies or anything else blocking?

Is there anything you need help with?

cscott added a comment.EditedMay 16 2017, 3:17 PM

There's an active effort on-wiki to make fixups, and quite a large number of pages have been fixed. The effort has been mentioned in Tech News for the past two weeks: https://meta.wikimedia.org/wiki/Tech/News/2017/19 https://meta.wikimedia.org/wiki/Tech/News/2017/20 and it looks likely to be merged in next week (or so) for gradual roll out.

On the Parsoid side, the blocking predecessor patch is currently https://gerrit.wikimedia.org/r/350867 which got a C+1 today and will likely be merged shortly. We'll want to deploy that carefully and watch for any new round-trip issues. (There are some bookkeeping issues with parser tests between core and Parsoid, but they are straightforward to address.) Assuming that deploying 350867 goes well, the actual language converter patch is https://gerrit.wikimedia.org/r/140235 and should be straightforward to deploy, although we'll want to double-check that there aren't any unexpected VE interactions.

That will complete the first stage, which is correctly parsing language converter markup. That's "phase 1" in the summary above. The next step is to actually process the parsed markup and apply conversions, which allows "read view" use of Parsoid markup for mobile, and to work on some VE support.

@Fjalapeno wrt "Is the anything you need help with" -- talk to User:DePiep if you would like to help w/ on-wiki fixup (or just jump in at https://www.mediawiki.org/wiki/Parsoid/Language_conversion/Preprocessor_fixups/20170501 ). If you're asking about helping on the code side, I'd say I could use some help on the VE side, starting with "phase 2" above -- now that Parsoid can emit markup for LanguageConverter constructs, VE needs a specialized editor to allow users to edit those constructs. That would bring VE to equivalence with the wikitext editor for zhwiki and friends.

@cscott thanks for the update… sorry for my late reply… Hackathon and then vacation. I'll check in on the preprocessor fixups and see how thats going

Change 140235 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Parse and serialize language converter markup.

https://gerrit.wikimedia.org/r/140235

cscott updated the task description. (Show Details)Jun 29 2017, 2:25 PM

We just merged a patch for "Phase 1" support of LC in Parsoid (using the phase descriptions I just updated in the task summary).

Mentioned in SAL (#wikimedia-operations) [2017-07-31T20:33:25Z] <cscott> Updated Parsoid to version 08114f35 (T43716, T154718, T166413)

Jdforrester-WMF updated the task description. (Show Details)

Change 396538 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Create skeleton of language variant support in Parsoid API

https://gerrit.wikimedia.org/r/396538

Change 396538 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Create skeleton of language variant support in Parsoid API

https://gerrit.wikimedia.org/r/396538

ssastry moved this task from Backlog to Read Views on the Parsoid board.Jan 11 2018, 9:43 PM
Amire80 moved this task from Untriaged to Script conversion on the I18n board.Feb 4 2018, 10:48 AM

Given that now one of the transformations was merged, we actually need a way to access that and to transform the HTML stored in RESTBase. What do you think about the following API in Parsoid?

POST /transform/html/to/html
Params:
 - Accept-Language in headers
 - html in body
cscott added a subscriber: Arlolra.Jun 5 2018, 8:03 PM

@Arlolra Is that ^ consistent with the other html2html endpoints we've implemented?

Unfortunately, that endpoint is a bit of a mess, see https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L568-L570

But, I'd expect something more along the lines of the following to work,

POST /transform/pagebundle/to/pagebundle
Params:
 - Accept-Language in headers
 - original.html in body

Unfortunately, that endpoint is a bit of a mess, see https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L568-L570

But, I'd expect something more along the lines of the following to work,

POST /transform/pagebundle/to/pagebundle
Params:
 - Accept-Language in headers
 - original.html in body

Hm, we would be sending only the HTML, though, and require only the HTML back. Supplying the data-parsoid as well would increase the load and latency. Do you expect the /html/to/html/ to be revised soon?

ssastry added a comment.EditedJun 6 2018, 2:24 PM

Unfortunately, that endpoint is a bit of a mess, see https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L568-L570

But, I'd expect something more along the lines of the following to work,

POST /transform/pagebundle/to/pagebundle
Params:
 - Accept-Language in headers
 - original.html in body

Hm, we would be sending only the HTML, though, and require only the HTML back. Supplying the data-parsoid as well would increase the load and latency. Do you expect the /html/to/html/ to be revised soon?

Note that pb2pb is that endpoint. Depending on the specific conversion operation, only some parts of the pagebundle might actually be required. So, you don't have to post data-parsoid in this case.

Consider the case when we split up data-mw into a different bucket. Then, data-mw will be part of a different posted param in case it is required for this conversion. So, pb2pb is the correct generic endpoint.

In T114413#2365456, I indicate that for all pb2pb endpoints, we should introduce an additional parameter that explicitly specifies the required conversion to eliminate complexity (which is the mess that Alro refers to above). So, we will likely add that to this pb2pb endpoint.

cscott added a comment.Jun 6 2018, 5:00 PM

Note to self: I probably should make sure LanguageConverter doesn't require access to data-parsoid.

cscott added a comment.EditedJun 6 2018, 9:26 PM

After some discussion on IRC (and review of T114413) I'm proposing the following API:

POST /transform/pagebundle/to/pagebundle
Request:

original: {
 html: {
  headers: {
    'content-type': 'text/html; charset=utf-8; profile="https://mediawiki.org/wiki/Specs/DOM/1.7.0"'
  },
  body: '<html>...</html>'
 },
},
updates: {
  variant: { source: 'en', target: 'en-x-piglatin' }
}

The variant.source property can be omitted (ie, undefined), in which case Parsoid will attempt to guess the source variant to support round-tripping. Setting source to null will disable round-trip support (useful for display-only use cases). Setting 'target' to the special value 'x-roundtrip' will use embedded round-trip metadata to attempt to convert the HTML back to the original source variant.

For example, Visual Editor might use: variant: { source: 'en', target: 'en-x-piglatin' } on English wikipedia, where it is known that all articles are stored in English, not Pig Latin. (Some other wikis have similarly "we always write in one specific variant" conventions.) When saving the edited document, it would use variant: { source: 'en-x-piglatin', target: 'x-roundtrip' } to convert it back to the original English text.

If an editor were to shift VE from zh-cn to zh-tw in the middle of an edit, two requests would probably have to be made: variant: { source: 'zh-cn', target: 'x-roundtrip' } to restore the original wikitext, then variant: { target: 'zh-tw' } on the result in order to convert to the user's new variant preference. At the moment we don't support combining these requests, but we might do so in the future.

MCS would use variant: { source: null, target: '...' } when localizing summaries or wikidata text for display.

At the moment we don't support combining a variant update with another sort of update (redlinks, etc), but we might do so in the future.

EDIT: updated with arlo's correction below.

For consistency,

Request:

original: { html: { ... } },
updates: { ... }

This API strikes me as complicated for an HTML-to-HTML transliteration. Namely, RB would need to completely reconstruct every request made to it for any variant other than the default one instead of simply getting the HTML and adding the A-L header to it. For round-tripping, do I understand correctly that 2 requests would need to be made: first to tell Parsoid we want round-tripping and the other one to specify the actual target? Wouldn't something like { source: 'zh-cn', target: 'zh-tw', roundtrip: true} work?

cscott added a comment.Jun 7 2018, 8:29 PM

No, round-tripping is the default. Specify source: null to explicitly disable it --- but the only reason to do so is to slim down the HTML a bit, maybe I don't even need to complicate the API for that.

The two requests example above is just for on-the-fly variant switching *while editing*. In that case you need to do a little dance instead of trying to convert directly from one variant to the other in order to ensure the round-trip information is preserved.

In most cases, you'd take the HTML from parsoid, stuff it into a JSON blob as original.html and then add updates.variant.target = 'my-target-variant' and send it to the pb2pb endpoint.

Or, you know, just ask for the variant you want directly from wt2html using the Accept-Language header...

cscott added a comment.Jun 8 2018, 4:54 PM

I'm inclined to implement the pb2pb endpoint compliant with T114413 first, then if we find that there's a significant efficiency loss by JSON-encoding the HTML we can talk about adding a new specialized endpoint?

We've had a meeting with @cscott yesterday and here's a couple of notes from our discussion worths mentioning:

  1. By default, we will return the "natural" variant - the HTML corresponding to the mixed-variant wikitext stored in the database. "By default" will be returned if no accept-language is provided, or if accept-language has a value that's not supported for a particular language.
  1. Looking at the domain in RESTBase is not enough for splitting the Varnish cache or for deciding whether to even look at the accept-language and go to Parsoid for transformation, since LanguageConverter is actually enabled on all wikis, so, for example, even on English Wikipedia certain pages can have a different page language that will support conversion. This is mostly important for multi-language wikis like mediawiki.org. For cache-splitting Parsoid could provide the info about page language in some meta tag, however, for making a decision whether to go to Parsoid for transformation that's not very convenient, at least it's not easy to bootstrap, cause all pages must be re-rendered and re-stored in order for this to work reliably. I'm evaluating the possibility to include this info in the title_revision table so that RESTBase could decide on its own.