Page MenuHomePhabricator

[EPIC] Support language variant conversion in Parsoid
Open, LowPublic

Description

This is the top-level tracker bug for LanguageConverter support in Parsoid.

Plan of record, roughly:

  • Phase 1: Parse all LC constructs into DOM (and round-trip them).

    This is sufficient to allow VE to edit LC wikis in same fashion as wikitext editor, w/ mix of variants displayed during editing.
  • Phase 3 (speculative): Use selective serialization to allow VE to operate on the converted text.

    This allows "single variant" editing, without the chaotic mix of variants shown in wikitext editing, and uses selective serialization to preserve the original variant of unedited text.
  • Phase 4 (speculative): Introduce new LC syntax or Glossary features which are a better match for future plans.

    This would avoid the "from this point forward" behavior of LC rules, which complicates incremental update, as well as avoiding the use of templates as a workaround for per-page glossaries. We might also introduce more pervasive language tagging in the source, to better match LC uses where character set can't be used to distinguish variant (toy example: pig latin -vs- english).

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
ResolvedJdforrester-WMF
Resolvedcscott
InvalidNone
DuplicateNone
ResolvedMarkTraceur
ResolvedJdlrobson
Resolved Pchelolo
ResolvedJdlrobson
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedcscott
Invalid GWicke
Resolvedliangent
OpenNone
DuplicateBUG REPORTNone
Resolvedcscott
OpenNone
OpenNone
ResolvedBUG REPORTJgiannelos
OpenNone
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

No, round-tripping is the default. Specify source: null to explicitly disable it --- but the only reason to do so is to slim down the HTML a bit, maybe I don't even need to complicate the API for that.

The two requests example above is just for on-the-fly variant switching *while editing*. In that case you need to do a little dance instead of trying to convert directly from one variant to the other in order to ensure the round-trip information is preserved.

In most cases, you'd take the HTML from parsoid, stuff it into a JSON blob as original.html and then add updates.variant.target = 'my-target-variant' and send it to the pb2pb endpoint.

Or, you know, just ask for the variant you want directly from wt2html using the Accept-Language header...

I'm inclined to implement the pb2pb endpoint compliant with T114413 first, then if we find that there's a significant efficiency loss by JSON-encoding the HTML we can talk about adding a new specialized endpoint?

We've had a meeting with @cscott yesterday and here's a couple of notes from our discussion worths mentioning:

  1. By default, we will return the "natural" variant - the HTML corresponding to the mixed-variant wikitext stored in the database. "By default" will be returned if no accept-language is provided, or if accept-language has a value that's not supported for a particular language.
  1. Looking at the domain in RESTBase is not enough for splitting the Varnish cache or for deciding whether to even look at the accept-language and go to Parsoid for transformation, since LanguageConverter is actually enabled on all wikis, so, for example, even on English Wikipedia certain pages can have a different page language that will support conversion. This is mostly important for multi-language wikis like mediawiki.org. For cache-splitting Parsoid could provide the info about page language in some meta tag, however, for making a decision whether to go to Parsoid for transformation that's not very convenient, at least it's not easy to bootstrap, cause all pages must be re-rendered and re-stored in order for this to work reliably. I'm evaluating the possibility to include this info in the title_revision table so that RESTBase could decide on its own.
ssastry raised the priority of this task from Low to Needs Triage.Sep 20 2018, 4:01 PM
ssastry triaged this task as High priority.

In my opinion, it would be the best to dump the whole LanguageConverter -{ }- markup, which is used to define specific variant translation for one term, and used the data from Wikidata instead. Wikidata can store language variant info and can be used across all wikimedia project, rather than volunteers maintaining the same CGroup across multiple project manually.

LGoto lowered the priority of this task from High to Medium.Mar 13 2020, 4:19 PM
LGoto moved this task from Missing Functionality to Future Ideas on the Parsoid board.

In my opinion, it would be the best to dump the whole LanguageConverter -{ }- markup, which is used to define specific variant translation for one term, and used the data from Wikidata instead. Wikidata can store language variant info and can be used across all wikimedia project, rather than volunteers maintaining the same CGroup across multiple project manually.

See also the Glossary RFC (T484). Unfortunately glossaries tend to be topic-specific --- the dictionary you'd use for a pop culture article about movies may not be appropriate for a science article -- but the glossary could certainly source the variants from wikidata. It would be useful to be able to reference glossaries in a global manner as well, so that pages in zh in places other than zhwiki (ie, commons or mediawiki.org or wikimania.org etc) can use the constructed glossaries.

ssastry renamed this task from Support language variant conversion in Parsoid to [EPIC] Support language variant conversion in Parsoid.Jul 15 2020, 5:55 PM
ssastry added a project: Parsoid-Rendering.

@cscott: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!

This ticket is important for the openZIM/Kiwix community and in particular its Chinese audience, see https://github.com/openzim/mwoffliner/issues/840

MSantos lowered the priority of this task from Medium to Low.Jun 26 2023, 3:16 PM