Support language variant conversion in Parsoid
Open, LowPublic

Description

This is the top-level tracker bug for LanguageConverter support in Parsoid.

Plan of record, roughly:

Phase 1: Parse all LC constructs into DOM (and round-trip them).

This is sufficient to allow VE to edit LC wikis in same fashion as wikitext editor, w/ mix of variants displayed during editing.

Phase 2: Actually run conversion on the DOM, using the parsed constructs.

This is sufficient for "read-view" use of Parsoid output, for example in mobile frontend, for google indexing, etc.

Phase 3 (speculative): Use selective serialization to allow VE to operate on the converted text.

This allows "single variant" editing, without the chaotic mix of variants shown in wikitext editing, and uses selective serialization to preserve the original variant of unedited text.

Phase 4 (speculative): Introduce new LC syntax or Glossary features which are a better match for future plans.

This would avoid the "from this point forward" behavior of LC rules, which complicates incremental update, as well as avoiding the use of templates as a workaround for per-page glossaries. We might also introduce more pervasive language tagging in the source, to better match LC uses where character set can't be used to distinguish variant (toy example: pig latin -vs- english).

Details

Reference
bz41716

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
dchan added a comment.Sep 11 2013, 8:01 PM

I think we should be *extremely* restrictive about where language rules can leak. This is because they lead to several problems:

(1) Rule changes make it hard to give a faithful real-time view of *any* plaintext.

(2) Rule changes can cause unexpected errors in distant text.

(3) Few people can proofread both zh-Hans and zh-Hant. Therefore, almost anyone who makes an edit will be unable to proofread at least one of the variants it might affect.

On the other hand, leaking currently allows pages to import rules. I think we should preserve this facility but make it more separate.

  1. In general, there should be no leakage: rules should be page-global, and should not leak into or out of templates. This means template *arguments* should be subject to the rules of the page in which they are written, but text generated by a template should not.
  1. As an exception to the "no leakage" rule, there should be a new type of template called a Glossary, whose only purpose is to leak rules into the calling page. That way, language rules are completely separate and independent of any other template behaviour. These Glossaries should be referenced at the top of the page only.
  1. The page which defines a template is free to use rules and Glossaries too. But they will only affect the text generated by the template itself -- they won't leak into any text defined in the calling page. This includes the arguments passed into the template, because they're written in the calling page.

As you can see, this is just cscott's "Global" proposal, but with the additional restriction that the templates that leak rules cannot have any other functionality.

David, Roan, Scott, Subbu and me met in the office to discuss this. Short summary of the plans for the next steps:

  1. Find nesting issues and see if we can fix them up with a bot. Also investigate use cases for markup in variant conversion rules.
  1. Parse all -{ }- syntax and represent it in the DOM. Exact spec TBD in https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks. Render the default variant according to the fallback chain for output-producing rules.
  1. Enable editing of inline (once-only) rules in the VE. Most rule table modifications seem to be templated and will not be applied, so are not directly relevant. Rules that only modify the table but produce no output directly in page content can be represented as mw:Placeholder and will simply be preserved.

This will make the VE usable for typical editors on variant-enabled wikis without requiring the variant conversion overhaul to be done first.

For the longer-term strategy, we (mostly) agreed on:

  1. Add the capability to associate an ordered list of glossaries with a page. These can either be stored in a separate namespace, or something like Special:Glossary. They should be revision-controlled and machine-readable for processing and UI purposes (JSON).
  1. Add the capability to add page-specific rules that override glossary rules. Only glossaries and global rules associated with the top-level page itself are considered. This makes the set of conversion rules independent of dynamic template expansions.
  1. Apply the combined rule set to the entire page including templated content. Rationales:
  2. Simple mental model
  3. efficient to implement
  4. consistent conversion of passed-in content, even if it is massaged further during transclusion expansion
  5. content in templates (labels, also real content in some infoboxes) themselves can still be protected or converted differently with local inline rules, as is done right now

The details on how this can be implemented depend on whether we reach our goal of implementing multi-part revision storage that we can use for metadata by the next quarter.

PS @David: Conversion rules should be passed into a pure function that converts each template expansion. Nothing at all should leak- otherwise our function would no longer be pure, and we could no longer efficiently update template expansions independently.

(In reply to comment #37)

David, Roan, Scott, Subbu and me met in the office to discuss this. Short
summary of the plans for the next steps:

  1. Find nesting issues and see if we can fix them up with a bot. Also investigate use cases for markup in variant conversion rules.

Why do we want to get rid of nested -{}- markups? It's useful in some cases.

See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed replacement for [[Template:地区用词]]).

Try to expand a [[Template:地区用词3]] call and see its result:

{{地区用词3|zh-cn=cn|zh-tw=tw}}

(In reply to comment #38)

(In reply to comment #37)
> David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> summary of the plans for the next steps:
>
> 1) Find nesting issues and see if we can fix them up with a bot. Also
> investigate use cases for markup in variant conversion rules.

Why do we want to get rid of nested -{}- markups? It's useful in some cases.

See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
replacement for [[Template:地区用词]]).

Try to expand a [[Template:地区用词3]] call and see its result:

{{地区用词3|zh-cn=cn|zh-tw=tw}}

Liangent: Gwicke did not fully explain the nesting issue we were talking about.

What we had in mind was use in attributes. Ex:-{zh-cn=<span style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are proposing using a bot to fix this to: <span style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form has the property that all HTML snippets have a well-formed DOM representation whereas the original does not.

(In reply to comment #39)

(In reply to comment #38)
> (In reply to comment #37)
> > David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> > summary of the plans for the next steps:
> >
> > 1) Find nesting issues and see if we can fix them up with a bot. Also
> > investigate use cases for markup in variant conversion rules.
>
> Why do we want to get rid of nested -{}- markups? It's useful in some cases.
>
> See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
> replacement for [[Template:地区用词]]).
>
> Try to expand a [[Template:地区用词3]] call and see its result:
>
> {{地区用词3|zh-cn=cn|zh-tw=tw}}

Liangent: Gwicke did not fully explain the nesting issue we were talking
about.

What we had in mind was use in attributes. Ex:-{zh-cn=<span
style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are
proposing
using a bot to fix this to: <span
style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form
has the property that all HTML snippets have a well-formed DOM representation
whereas the original does not.

Oh that's what we've discussed before - and that's fine.

Change 50767 abandoned by GWicke:
Revert "(bug 41716) Add variant config to siprop=general"

Reason:
This ship has sadly sailed. Too late to clean it up I guess. Sigh.

https://gerrit.wikimedia.org/r/50767

@liangent: can you describe the use cases for "the other kind" of nested markup?
That is, -{ }- inside -{ }-?

Our proposed DOM tree (https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks) can handle:

-{ foo -{ bar }- bat }-

but not

foo-{zh-cn:blog -{ nested }-; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

etc.

How are nested -{ }- markups of this sort actually used?

They're mostly used in template. See also [[zh:Template:DISPLAYTITLE]] and [[zh:Module:Template:地区用词]].

Change 140235 had a related patch set uploaded by Cscott:
WIP: parse language converter markup.

https://gerrit.wikimedia.org/r/140235

I've written up some notes about nested conversion blocks and other discoveries at https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks#Notes

(In reply to Liangent from comment #0)

Phase 1: Capsule conversion syntax (-{}- markups) into non-editable blocks
to avoid breakage.

(In reply to Gabriel Wicke from comment #2)

It would be good to look into implementing phase 1 (recognize and protect
language conversion content).

I understand this has some value in itself for PDF export, see bug 34919 comment 17.

(In reply to Nemo from comment #46)

I understand this has some value in itself for PDF export, see bug 34919
comment 17.

And more were filed, like bug 71815. Should they all depend from this?

Does this really depend on bug 43547? Maybe this should just be converted to a tracking bug so that we're free to add dependencies without hairsplitting.

ssastry moved this task from Backlog to Needs Discussion on the Parsoid board.Dec 9 2014, 5:23 PM
ssastry moved this task from Needs Discussion to In Progress on the Parsoid board.Feb 2 2015, 2:50 PM
ssastry moved this task from In Progress to Backlog on the Parsoid board.May 20 2015, 9:41 PM

Change 50136 abandoned by Jforrester:
(bug 41716) Tokenize language variant conversions

Reason:
Wrong repo now.

https://gerrit.wikimedia.org/r/50136

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 25 2015, 8:36 AM
jmadler added a subscriber: jmadler.Jan 6 2016, 5:13 AM
GWicke added a subscriber: bearND.Jan 6 2016, 6:27 PM
brion added a subscriber: brion.Aug 24 2016, 8:57 PM

We discussed this a little in Archcom meeting; here's some quick notes:

  • add a 'phase 0' to define a 'sane subset' of the existing markup behavior that we recommend supporting
  • figure out how to do the 'phase 1' in parsoid-land <- this gets us to a place where we might be able to use VE on non-Chinese wikis using LC
    • then figure out how to get VE to make the definition blocks display/edit sanely (phase 2) <- should be enough to get Chinese editable but with the mixed scripts
  • later, figure out how (or if) to do full VE-side application of conversion for display during editing without changing the underlying data that gets saved back (phase 3) -- this is potentially *very* hard.

The phase 0 syntax limitation would essentially mark some things as 'undefined behavior' for a spec -- such as using the vocab definitions to change markup or html elements -- and would make the display simpler and the editing MUCH simpler.

LanguageConverter markup is back on my plate over in Parsoid land. I'll be dusting off the existing Parsoid patches as a first step.

LanguageConverter markup is back on my plate over in Parsoid land. I'll be dusting off the existing Parsoid patches as a first step.

Excellent! It seems like Brion's phase 0 ("define a 'sane subset' of the existing markup behavior that we recommend supporting") seems like something that should be filed as an RFC. Should we try to do that as part of T142803 or does that need it's own RFC?

Change 140235 had a related patch set uploaded (by C. Scott Ananian):
WIP: parse language converter markup.

https://gerrit.wikimedia.org/r/140235

Liuxinyu970226 changed the task status from Open to Stalled.Jan 1 2017, 5:43 AM

Stalled per that patch, "Main test build failed." or "Merge Failed." happened for too many times

Legoktm changed the task status from Stalled to Open.Jan 1 2017, 10:04 AM
Legoktm added a subscriber: Legoktm.

Stalled per that patch, "Main test build failed." or "Merge Failed." happened for too many times

It's not stalled. Unless you actually know that a task is stalled, please don't mark it as so.

@cscott I was following the comments / commit message of this ticket. I see your Gerrit patch, it looks like it is waiting on:
https://gerrit.wikimedia.org/r/#/c/333997/

Which itself is waiting on a lot of pages to be fixed up with some additional markup:
https://www.mediawiki.org/wiki/Parsoid/Language_conversion/Preprocessor_fixups

Is that about the state of things?

Is that process being automated or did you figure out a solution? Are there any other dependencies or anything else blocking?

Is there anything you need help with?

cscott added a comment.EditedMay 16 2017, 3:17 PM

There's an active effort on-wiki to make fixups, and quite a large number of pages have been fixed. The effort has been mentioned in Tech News for the past two weeks: https://meta.wikimedia.org/wiki/Tech/News/2017/19 https://meta.wikimedia.org/wiki/Tech/News/2017/20 and it looks likely to be merged in next week (or so) for gradual roll out.

On the Parsoid side, the blocking predecessor patch is currently https://gerrit.wikimedia.org/r/350867 which got a C+1 today and will likely be merged shortly. We'll want to deploy that carefully and watch for any new round-trip issues. (There are some bookkeeping issues with parser tests between core and Parsoid, but they are straightforward to address.) Assuming that deploying 350867 goes well, the actual language converter patch is https://gerrit.wikimedia.org/r/140235 and should be straightforward to deploy, although we'll want to double-check that there aren't any unexpected VE interactions.

That will complete the first stage, which is correctly parsing language converter markup. That's "phase 1" in the summary above. The next step is to actually process the parsed markup and apply conversions, which allows "read view" use of Parsoid markup for mobile, and to work on some VE support.

@Fjalapeno wrt "Is the anything you need help with" -- talk to User:DePiep if you would like to help w/ on-wiki fixup (or just jump in at https://www.mediawiki.org/wiki/Parsoid/Language_conversion/Preprocessor_fixups/20170501 ). If you're asking about helping on the code side, I'd say I could use some help on the VE side, starting with "phase 2" above -- now that Parsoid can emit markup for LanguageConverter constructs, VE needs a specialized editor to allow users to edit those constructs. That would bring VE to equivalence with the wikitext editor for zhwiki and friends.

@cscott thanks for the update… sorry for my late reply… Hackathon and then vacation. I'll check in on the preprocessor fixups and see how thats going

Change 140235 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Parse and serialize language converter markup.

https://gerrit.wikimedia.org/r/140235

cscott updated the task description. (Show Details)Jun 29 2017, 2:25 PM

We just merged a patch for "Phase 1" support of LC in Parsoid (using the phase descriptions I just updated in the task summary).

Mentioned in SAL (#wikimedia-operations) [2017-07-31T20:33:25Z] <cscott> Updated Parsoid to version 08114f35 (T43716, T154718, T166413)

Jdforrester-WMF updated the task description. (Show Details)