Maniphest T43716

[EPIC] Support language variant conversion in Parsoid
Open, LowPublic
Actions

Description

This is the top-level tracker bug for LanguageConverter support in Parsoid.

Plan of record, roughly:

Phase 1: Parse all LC constructs into DOM (and round-trip them).

This is sufficient to allow VE to edit LC wikis in same fashion as wikitext editor, w/ mix of variants displayed during editing.

Phase 2: Actually run conversion on the DOM, using the parsed constructs.

This is sufficient for "read-view" use of Parsoid output, for example in mobile frontend, for google indexing, etc.
- Experimental deployment of read-view conversion for selected languages (including RESTBase support)
- 2A: T204966: Production use of LanguageConverter for read views of Phase 2A languages
- 2B: T204968: Production use of LanguageConverter for read views of Phase 2B languages
- 2C: T204969: Production use of LanguageConverter for read views of Phase 2C languages

Phase 3 (speculative): Use selective serialization to allow VE to operate on the converted text.

This allows "single variant" editing, without the chaotic mix of variants shown in wikitext editing, and uses selective serialization to preserve the original variant of unedited text.

Phase 4 (speculative): Introduce new LC syntax or Glossary features which are a better match for future plans.

This would avoid the "from this point forward" behavior of LC rules, which complicates incremental update, as well as avoiding the use of templates as a workaround for per-page glossaries. We might also introduce more pervasive language tagging in the source, to better match LC uses where character set can't be used to distinguish variant (toy example: pig latin -vs- english).

Details

Reference: bz41716

	Subject	Repo	Branch	Lines +/-
	Create skeleton of language variant support in Parsoid API	mediawiki/services/parsoid	master	+44 -0
	Parse and serialize language converter markup.	mediawiki/services/parsoid	master	+1 K -323

Customize query in gerrit

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Subtype	Assigned	Task
			· · ·
Invalid		None	T36919 Language conversion is not applied in documents delivered by the Collection extension
Resolved		Jdforrester-WMF	T53792 VisualEditor: Non-English Wikipedia issues (tracking)
Resolved		cscott	T49411 VisualEditor: Support "Language conversion blocks" for multi-script wikis
Duplicate		None	T73815 PDF should be rendered in the user's specified language converter variant, e.g. Serbian in Latin script (sr-el)
Resolved		MarkTraceur	T45332 Add non-English wiki support to Parsoid
Resolved		Jdlrobson	T186627 Remove the MobileView API
Resolved		• Pchelolo	T159985 Implement language variant support in the REST API
Resolved		Jdlrobson	T236733 mobile-html: Remove mobile view API dependency when parsoid supports language variants on zhwiki
Open		None	T95497 StructuredDiscussions (Flow): Language conversion support for all Wikipedias that use multiple writing systems
Open		None	T55784 [EPIC] Use Parsoid HTML for all page views
Open		None	T39617 Do not convert text marked as being in another language with a lang attribute
Open		None	T43716 [EPIC] Support language variant conversion in Parsoid
Open		None	T21044 Document LanguageConverter
Resolved		cscott	T53587 Parsoid needs to run findVariantLink or some equivalent thing
Invalid		• GWicke	T48658 Tpl-style encapsulation for <include> and lang-variant conversions
Resolved		liangent	T45547 MediaWiki needs a fictitious variant for English for easier variant development work
Open		None	T54661 Preprocessor/Parser irregularities with -{...}- variant constructs.
Duplicate	BUG REPORT	None	T353501 new Parsoid cannot parse the converter wikitext syntax
Resolved		cscott	T153341 Export LanguageConverter enabled status in page info from core
Open		None	T204966 Production use of LanguageConverter for read views of Phase 2A languages
Open		None	T222328 [extlink] parsing - link cannot contain language variant or extension tags
Resolved	BUG REPORT	Jgiannelos	T305383 [BUG] Kazakh Wikipedia Character mapping
Open		None	T320733 Support and document how language conversion work with multidirectional wikitext <=> HTML conversion on language-conversion-supported extensions.
			· · ·

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Given that now one of the transformations was merged, we actually need a way to access that and to transform the HTML stored in RESTBase. What do you think about the following API in Parsoid?

POST /transform/html/to/html
Params:
 - Accept-Language in headers
 - html in body

@Arlolra Is that ^ consistent with the other html2html endpoints we've implemented?

Unfortunately, that endpoint is a bit of a mess, see https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L568-L570

But, I'd expect something more along the lines of the following to work,

POST /transform/pagebundle/to/pagebundle
Params:
 - Accept-Language in headers
 - original.html in body

Jdforrester-WMF removed a subtask: T95497: StructuredDiscussions (Flow): Language conversion support for all Wikipedias that use multiple writing systems.Jun 5 2018, 10:34 PM

Jdforrester-WMF added a parent task: T95497: StructuredDiscussions (Flow): Language conversion support for all Wikipedias that use multiple writing systems.

In T43716#4259276, @Arlolra wrote:
Unfortunately, that endpoint is a bit of a mess, see https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L568-L570

But, I'd expect something more along the lines of the following to work,
POST /transform/pagebundle/to/pagebundle
Params:
 - Accept-Language in headers
 - original.html in body

Hm, we would be sending only the HTML, though, and require only the HTML back. Supplying the data-parsoid as well would increase the load and latency. Do you expect the /html/to/html/ to be revised soon?

In T43716#4260719, @mobrovac wrote:
In T43716#4259276, @Arlolra wrote:
Unfortunately, that endpoint is a bit of a mess, see https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L568-L570

But, I'd expect something more along the lines of the following to work,
POST /transform/pagebundle/to/pagebundle
Params:
 - Accept-Language in headers
 - original.html in body
Hm, we would be sending only the HTML, though, and require only the HTML back. Supplying the data-parsoid as well would increase the load and latency. Do you expect the /html/to/html/ to be revised soon?

Note that pb2pb is that endpoint. Depending on the specific conversion operation, only some parts of the pagebundle might actually be required. So, you don't have to post data-parsoid in this case.

Consider the case when we split up data-mw into a different bucket. Then, data-mw will be part of a different posted param in case it is required for this conversion. So, pb2pb is the correct generic endpoint.

In T114413#2365456, I indicate that for all pb2pb endpoints, we should introduce an additional parameter that explicitly specifies the required conversion to eliminate complexity (which is the mess that Alro refers to above). So, we will likely add that to this pb2pb endpoint.

Note to self: I probably should make sure LanguageConverter doesn't require access to data-parsoid.

After some discussion on IRC (and review of T114413) I'm proposing the following API:

POST /transform/pagebundle/to/pagebundle
Request:

original: {
 html: {
  headers: {
    'content-type': 'text/html; charset=utf-8; profile="https://mediawiki.org/wiki/Specs/DOM/1.7.0"'
  },
  body: '<html>...</html>'
 },
},
updates: {
  variant: { source: 'en', target: 'en-x-piglatin' }
}

The variant.source property can be omitted (ie, undefined), in which case Parsoid will attempt to guess the source variant to support round-tripping. Setting source to null will disable round-trip support (useful for display-only use cases). Setting 'target' to the special value 'x-roundtrip' will use embedded round-trip metadata to attempt to convert the HTML back to the original source variant.

For example, Visual Editor might use: variant: { source: 'en', target: 'en-x-piglatin' } on English wikipedia, where it is known that all articles are stored in English, not Pig Latin. (Some other wikis have similarly "we always write in one specific variant" conventions.) When saving the edited document, it would use variant: { source: 'en-x-piglatin', target: 'x-roundtrip' } to convert it back to the original English text.

If an editor were to shift VE from zh-cn to zh-tw in the middle of an edit, two requests would probably have to be made: variant: { source: 'zh-cn', target: 'x-roundtrip' } to restore the original wikitext, then variant: { target: 'zh-tw' } on the result in order to convert to the user's new variant preference. At the moment we don't support combining these requests, but we might do so in the future.

MCS would use variant: { source: null, target: '...' } when localizing summaries or wikidata text for display.

At the moment we don't support combining a variant update with another sort of update (redlinks, etc), but we might do so in the future.

EDIT: updated with arlo's correction below.

For consistency,

Request:

original: { html: { ... } },
updates: { ... }

This API strikes me as complicated for an HTML-to-HTML transliteration. Namely, RB would need to completely reconstruct every request made to it for any variant other than the default one instead of simply getting the HTML and adding the A-L header to it. For round-tripping, do I understand correctly that 2 requests would need to be made: first to tell Parsoid we want round-tripping and the other one to specify the actual target? Wouldn't something like { source: 'zh-cn', target: 'zh-tw', roundtrip: true} work?

Jdforrester-WMF mentioned this in T195948: MCS should respect Accept-Language header.Jun 7 2018, 5:28 PM

Mentioned in SAL (#wikimedia-operations) [2018-06-07T18:13:40Z] <subbu> Updated Parsoid (T183706, T192726, T194879, T196357, T196360, T43716)

No, round-tripping is the default. Specify source: null to explicitly disable it --- but the only reason to do so is to slim down the HTML a bit, maybe I don't even need to complicate the API for that.

The two requests example above is just for on-the-fly variant switching *while editing*. In that case you need to do a little dance instead of trying to convert directly from one variant to the other in order to ensure the round-trip information is preserved.

In most cases, you'd take the HTML from parsoid, stuff it into a JSON blob as original.html and then add updates.variant.target = 'my-target-variant' and send it to the pb2pb endpoint.

Or, you know, just ask for the variant you want directly from wt2html using the Accept-Language header...

I'm inclined to implement the pb2pb endpoint compliant with T114413 first, then if we find that there's a significant efficiency loss by JSON-encoding the HTML we can talk about adding a new specialized endpoint?

We've had a meeting with @cscott yesterday and here's a couple of notes from our discussion worths mentioning:

By default, we will return the "natural" variant - the HTML corresponding to the mixed-variant wikitext stored in the database. "By default" will be returned if no accept-language is provided, or if accept-language has a value that's not supported for a particular language.

Looking at the domain in RESTBase is not enough for splitting the Varnish cache or for deciding whether to even look at the accept-language and go to Parsoid for transformation, since LanguageConverter is actually enabled on all wikis, so, for example, even on English Wikipedia certain pages can have a different page language that will support conversion. This is mostly important for multi-language wikis like mediawiki.org. For cache-splitting Parsoid could provide the info about page language in some meta tag, however, for making a decision whether to go to Parsoid for transformation that's not very convenient, at least it's not easy to bootstrap, cause all pages must be re-rendered and re-stored in order for this to work reliably. I'm evaluating the possibility to include this info in the title_revision table so that RESTBase could decide on its own.

phuedx mentioned this in T188164: Popups don‘t support language variant conversion and {{lang}} template.Jul 2 2018, 9:07 AM

cscott mentioned this in T198970: Epic: Implement SEO improvements suggested by Go Fish Digital.Jul 19 2018, 2:39 PM

Dbrant subscribed.Aug 21 2018, 5:34 PM

Reedy edited projects, added Parsoid-Read-Views-Deprecated-Project; removed Parsoid.Sep 17 2018, 7:25 PM

ssastry moved this task from Uncategorized to Language Variants on the Parsoid-Read-Views-Deprecated-Project board.Sep 20 2018, 3:42 PM

cscott updated the task description. (Show Details)Sep 20 2018, 3:53 PM

cscott added a subtask: T204966: Production use of LanguageConverter for read views of Phase 2A languages.

cscott updated the task description. (Show Details)Sep 20 2018, 3:56 PM

ssastry raised the priority of this task from Low to Needs Triage.Sep 20 2018, 4:01 PM

ssastry triaged this task as High priority.

cooltey subscribed.Oct 19 2018, 6:08 PM

cscott mentioned this in T191925: Discuss use of Finite State Transducer based formalism for language variant implementations.Oct 23 2018, 2:42 PM

• mobrovac added a project: Platform Team Workboards (Blocked Externally).Dec 20 2018, 12:45 PM

Tgr mentioned this in T213368: Support language variants in Proton.Jan 10 2019, 2:47 AM

• mobrovac edited projects, added Services (watching), Platform Team Legacy (Watching / External); removed Platform Team Workboards (Blocked Externally), Services (blocked), Patch-For-Review.Mar 14 2019, 12:57 AM

JoeWalsh added a parent task: T236733: mobile-html: Remove mobile view API dependency when parsoid supports language variants on zhwiki.Oct 28 2019, 7:36 PM

JoeWalsh mentioned this in T210808: Mark the MobileView API as deprecated.Oct 28 2019, 7:42 PM

In my opinion, it would be the best to dump the whole LanguageConverter -{ }- markup, which is used to define specific variant translation for one term, and used the data from Wikidata instead. Wikidata can store language variant info and can be used across all wikimedia project, rather than volunteers maintaining the same CGroup across multiple project manually.

Aklapper edited projects, added Parsoid; removed Parsoid-Read-Views-Deprecated-Project.Feb 29 2020, 5:14 PM

ssastry moved this task from Read Views to Missing Functionality on the Parsoid board.Feb 29 2020, 5:31 PM

LGoto lowered the priority of this task from High to Medium.Mar 13 2020, 4:19 PM

LGoto moved this task from Missing Functionality to Future Ideas on the Parsoid board.

ssastry moved this task from Future Ideas to Missing Functionality on the Parsoid board.Mar 24 2020, 4:32 AM

ppelberg subscribed.Apr 16 2020, 7:30 PM

cscott added a subtask: T222328: [extlink] parsing - link cannot contain language variant or extension tags.Apr 17 2020, 4:25 PM

Kelson added a project: affects-Kiwix-and-openZIM.Apr 25 2020, 12:44 PM

Kelson moved this task from TRIAGE to NORMAL on the affects-Kiwix-and-openZIM board.

In T43716#5886163, @VulpesVulpes825 wrote:

In my opinion, it would be the best to dump the whole LanguageConverter -{ }- markup, which is used to define specific variant translation for one term, and used the data from Wikidata instead. Wikidata can store language variant info and can be used across all wikimedia project, rather than volunteers maintaining the same CGroup across multiple project manually.

See also the Glossary RFC (T484). Unfortunately glossaries tend to be topic-specific --- the dictionary you'd use for a pop culture article about movies may not be appropriate for a science article -- but the glossary could certainly source the variants from wikidata. It would be useful to be able to reference glossaries in a global manner as well, so that pages in zh in places other than zhwiki (ie, commons or mediawiki.org or wikimania.org etc) can use the constructed glossaries.

ssastry renamed this task from Support language variant conversion in Parsoid to [EPIC] Support language variant conversion in Parsoid.Jul 15 2020, 5:55 PM

ssastry added a project: Parsoid-Rendering.

ssastry moved this task from Missing Functionality to Known Differences on the Parsoid board.Jul 16 2020, 5:59 PM

ssastry moved this task from Known Differences to Missing Functionality on the Parsoid board.

Tgr mentioned this in T267694: Add a link in VE: define exclusion rules for finding text in the DOM.Jan 21 2021, 7:44 AM

Winston_Sung subscribed.Jul 7 2021, 7:17 PM

Winston_Sung removed a parent task: T95497: StructuredDiscussions (Flow): Language conversion support for all Wikipedias that use multiple writing systems.Nov 28 2021, 5:49 AM

Winston_Sung added a subtask: T95497: StructuredDiscussions (Flow): Language conversion support for all Wikipedias that use multiple writing systems.

Winston_Sung removed a subtask: T95497: StructuredDiscussions (Flow): Language conversion support for all Wikipedias that use multiple writing systems.

Winston_Sung added a parent task: T95497: StructuredDiscussions (Flow): Language conversion support for all Wikipedias that use multiple writing systems.

Stang subscribed.Jan 1 2022, 12:19 AM

@cscott: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!