Support language variant conversion in Parsoid
OpenPublic

Description

I'm not expecting this will happen soon. Just leave this bug here.

The required features are describe below. Some may belong to VisualEditor.

Phase 1: Capsule conversion syntax (-{}- markups) into non-editable blocks to avoid breakage.

Phase 2: Enable editing of these conversion blocks.

Phase 3: Convert all text in DOM to requested variant, and convert it back to original variant when constructing wikitext. Don't change text to another variant if user doesn't edit that word in DOM. Shadowing whole text may be needed here.


Version: unspecified
Severity: normal

bzimport added projects: Parsoid, I18n.Via ConduitNov 22 2014, 1:08 AM
bzimport set Reference to bz41716.
liangent created this task.Via LegacyNov 2 2012, 10:45 PM
GWicke added a comment.Via ConduitNov 2 2012, 10:46 PM

Oh, that will be fun ;)

GWicke added a comment.Via ConduitFeb 19 2013, 6:59 PM

It would be good to look into implementing phase 1 (recognize and protect language conversion content).

MarkTraceur added a comment.Via ConduitFeb 20 2013, 12:55 AM

Liangent, can you please link us to documentation about how this works? Initial searches have been less than fruitful.

MarkTraceur added a comment.Via ConduitFeb 20 2013, 1:17 AM

Next question: Should a construct like {{variantopen}}令{{variantclose}} work (assume it expands to -{令}-)? If not, would it be difficult to phase that construct out as deprecated and go forward with Parsoid not supporting it?

Thanks for your help, we'd love to get Parsoid working with zh-wikis.

liangent added a comment.Via ConduitFeb 24 2013, 11:55 AM

(In reply to comment #3)

Liangent, can you please link us to documentation about how this works?
Initial
searches have been less than fruitful.

Do you means how it's done in the PHP parsing process, or what is expected to be done (specification of related syntax)?

liangent added a comment.Via ConduitFeb 24 2013, 12:02 PM

(In reply to comment #4)

Next question: Should a construct like {{variantopen}}令{{variantclose}} work
(assume it expands to -{令}-)? If not, would it be difficult to phase that
construct out as deprecated and go forward with Parsoid not supporting it?

Thanks for your help, we'd love to get Parsoid working with zh-wikis.

That construct works in the PHP converter.

MarkTraceur added a comment.Via ConduitMar 27 2013, 10:02 PM

We really need to know about how this is *supposed* to go, and we need English documentation for it if our team is going to work on it. The current offerings are all in other languages I think.

GWicke added a comment.Via ConduitMar 28 2013, 5:24 PM

A few notes from IRC:

[09:59] <gwicke> marktraceur: I browsed the LanguageConverter source a bit
[09:59] <gwicke> there is an autoConvert method that simply converts all text based on a dictionary lookup
[10:00] <gwicke> it only excludes markup and script/code blocks
[10:00] <gwicke> the default search language for Chinese seems to be zh-hans (simplified)
[10:01] <gwicke> am not sure when the special conversion syntax is used in practice
[10:03] --> tewwy has joined this channel (~tychay@wikimedia/Tychay).
[10:03] <gwicke> conversion is restricted to those blocks when using convert() and convertTo()
[10:03] <gwicke> plus special conversion for link targets and headings
[10:04] <gwicke> the conversion itself is performed using autoConvert (the dictionary-based method)
[10:06] * cscott is reading backlog
[10:07] <cscott> yeah, i mentioned getting minority-language buy-in in the meeting yesterday, thinking specifically of how hard it's been to get i18n feedback
[10:07] --> HaeB has joined this channel (~quassel@wikipedia/HochaufeinemBaum).
[10:08] <cscott> this languageconverter thing is changing simplified chinese to traditional, and vice-versa? ie, mainland-to-taiwan and back?
[10:13] <gwicke> cscott: there are four variants for Chinese I think
[10:13] <gwicke> Serbian and some other languages have variants too
[10:14] <gwicke> marktraceur: so my reading is that normally convert() is used, which only converts marked-up blocks (-{ }-)
[10:15] <gwicke> except for search, which uses autoconvert directly
[10:16] <gwicke> the conversion is also lossy, but less ambiguous when converting from traditional to simplified for example
[10:16] <gwicke> now the question is how we should represent all this in the DOM
[10:20] *** edsanders|away is now known as edsanders.
[10:20] <gwicke> on one hand it would be nice to abstract the issue, but with the conversion being lossy that does not seem to be possible without preserving the original (potentially mixed-variant) text

liangent added a comment.Via ConduitMar 29 2013, 11:23 AM

(In reply to comment #9)

A few notes from IRC:

Let me explain more:

The main entry point should be convertTo(), with convert() as a shortcut to use the "preferred" (= automatically guessed from request) variant. It accepts an almost-parsed HTML document (string) with -{}- markups embedded.

convertTo() is just a loader. It calls recursiveConvert* afterwards, which parse -{}- syntax, and break text into pieces based on -{}- markups. These pieces are fed into autoConvert().

autoConvert() extracts text snippets which actually need conversion (with HTML tags, <code> blocks etc. excluded, but include "title" attribs in HTML tags again...), then send these snippets to translate().

translate() transforms text finally using strtr()-like mechanism.

liangent added a comment.Via ConduitMar 30 2013, 8:42 AM

(In reply to comment #12)

Some more info:
http://www.mediawiki.org/wiki/Parsoid/Language_conversion

Maybe you want to avoid pasting IPs in those join-messages onto the wiki next time. :)

GWicke added a comment.Via ConduitMar 30 2013, 4:06 PM

The channel is public anyway, but pasting them on the wiki certainly makes it easier to search for names. It might be a good idea for you to get an IRC hostmask cloak, so that the IP does not show up in IRC logs.

liangent added a comment.Via ConduitMar 30 2013, 4:09 PM

(In reply to comment #14)

The channel is public anyway, but pasting them on the wiki certainly makes it
easier to search for names. It might be a good idea for you to get an IRC
hostmask cloak, so that the IP does not show up in IRC logs.

I already have one, but I often see this happening:

[09:23] --> spectie has joined this channel (~fran@***).
[09:23] <-- spectie has left this server (Changing host).
[09:23] --> spectie has joined this channel (~fran@unaffiliated/spectie).

I guess it happpens when the user /msg nickserv identify xxx after they joins the channel, and the sequence is usually decided by their IRC client.

liangent added a comment.Via ConduitJul 1 2013, 9:43 PM

About global state of dictionaries: the table affected by -{H| }- is used for link & categorylink resolution too. We may want to keep this behavior.

Aklapper added a comment.Via ConduitJul 4 2013, 10:33 AM

[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]

liangent added a comment.Via ConduitJul 5 2013, 8:54 PM

(In reply to comment #16)

About global state of dictionaries: the table affected by -{H| }- is used for
link & categorylink resolution too. We may want to keep this behavior.

One thing more about -{H| }-: the current behavior is that it only affects text after it and this behavior is sometimes deliberately used. We may want to keep it.

GWicke added a comment.Via ConduitJul 9 2013, 2:40 AM

(In reply to comment #18)

(In reply to comment #16)
> About global state of dictionaries: the table affected by -{H| }- is used for
> link & categorylink resolution too. We may want to keep this behavior.

One thing more about -{H| }-: the current behavior is that it only affects
text
after it and this behavior is sometimes deliberately used. We may want to
keep
it.

For us mutable global state is very hard to support in any sane way. Having page-global dictionary definitions or self-contained manual conversions is fine, but changing global state in the middle of the page (even from a dynamically changing template) conflicts with a lot of optimizations and is hard to represent in a UI.

liangent added a comment.Via ConduitJul 14 2013, 5:54 PM
  • Bug 51325 has been marked as a duplicate of this bug. ***
GWicke added a comment.Via ConduitJul 18 2013, 12:25 AM

Changed the title back to "Support language variant conversion in Parsoid" as this is not just about the syntax.

liangent added a comment.Via ConduitJul 18 2013, 12:26 AM

(In reply to comment #21)

Changed the title back to "Support language variant conversion in Parsoid" as
this is not just about the syntax.

There're too many things, far more from what I mentioned in comment 0 and I may be going to add some separate bugs from time to time... or do you want to use this one as some "meta" bug?

GWicke added a comment.Via ConduitJul 18 2013, 1:17 AM

Yes, this is the meta bug that depends on several other bugs (see the "Depends on" field).

Once we have a good overview of the issues we should probably get together to discuss possible solutions. Will you be at Wikimania?

cscott added a comment.Via ConduitAug 9 2013, 2:40 AM

See also bug 52661 -- the language converter should be integrated better with the preprocessor, in both PHP and Parsoid.

GWicke added a comment.Via ConduitAug 11 2013, 4:09 AM

(In reply to comment #24)

See also bug 52661 -- the language converter should be integrated better with
the preprocessor, in both PHP and Parsoid.

The language converter is actually a post-processor rather than a preprocessor. Why should that change?

liangent added a comment.Via ConduitAug 11 2013, 4:41 AM

(In reply to comment #26)

(In reply to comment #24)
> See also bug 52661 -- the language converter should be integrated better with
> the preprocessor, in both PHP and Parsoid.

The language converter is actually a post-processor rather than a
preprocessor.
Why should that change?

Their point is to have the preprocess understand those markups, to avoid interpreting them as something else.

cscott added a comment.Via ConduitAug 12 2013, 3:31 PM

@GWicke wrt comment 26 -- because it would fix the bugs documented in bug 52661. (In particular, <gallery> is in sad shape right now.)

cscott added a comment.Via ConduitAug 15 2013, 4:30 PM

See also http://www.mediawiki.org/wiki/Requests_for_comment/Scoped_language_converter

gwicke has an alternate proposal, which I'm sure he'll link here at some point.

As I understand it, we will parse the language converter markup, and then we will have a post-processing step which will actually apply the rules and markup to convert the text into the desired variant. As discussed (to some extent) in bug 15161, ideally visual editor would present the text in the user's preferred variant and then we would leverage the selser mechanism to ensure a change in variant applies only to the edited portion of the text. Again, ideally DOM blocks would be annotated with the variant they were written in.

GWicke added a comment.Via ConduitAug 15 2013, 5:24 PM

A rough outline of my proposal as developed during Wikimania with Liangent is at http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks. Scott would like to add additional syntax, while I am proposing a two-phase approach that 1) aims at supporting visual editing of existing content and 2) builds the infrastructure for clean language variant conversion based on page-global and category-global rules and then migrates dynamic rule table modifications out of articles and templates.

liangent added a comment.Via ConduitAug 15 2013, 6:07 PM

(In reply to comment #30)

A rough outline of my proposal as developed during Wikimania with Liangent is
at
http://www.mediawiki.org/wiki/Parsoid/
MediaWiki_DOM_spec#Language_conversion_blocks.
Scott would like to add additional syntax, while I am proposing a two-phase
approach that 1) aims at supporting visual editing of existing content and 2)
builds the infrastructure for clean language variant conversion based on
page-global and category-global rules and then migrates dynamic rule table
modifications out of articles and templates.

However this plan is only doable after Parsoid become the default parser, and all migration process must be done at exactly the same time as Parsoid becoming so... to keep everything working. Because (1) PHP parser doesn't understand your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

GWicke added a comment.Via ConduitAug 15 2013, 6:49 PM

(In reply to comment #31)

However this plan is only doable after Parsoid become the default parser, and
all migration process must be done at exactly the same time as Parsoid
becoming
so... to keep everything working. Because (1) PHP parser doesn't understand
your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

No, this is not depending on Parsoid becoming the default parser. It does depend on efficient access to global conversion rules at parse time, which is true for both approaches.

The main difference is that I favor direct (and mostly automatic) migration of rules to versioned page metadata for efficient access and gadget / UI-based editing. The processing model is also designed to be efficient with independent transclusion expansions as done in Parsoid.

Scott prefers to store rules in page and template content instead, and lets rules leak out of templates.

liangent added a comment.Via ConduitAug 15 2013, 7:00 PM

(In reply to comment #32)

(In reply to comment #31)
> However this plan is only doable after Parsoid become the default parser, and
> all migration process must be done at exactly the same time as Parsoid
> becoming
> so... to keep everything working. Because (1) PHP parser doesn't understand
> your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

No, this is not depending on Parsoid becoming the default parser. It does
depend on efficient access to global conversion rules at parse time, which is
true for both approaches.

The main difference is that I favor direct (and mostly automatic) migration
of
rules to versioned page metadata for efficient access and gadget / UI-based
editing. The processing model is also designed to be efficient with
independent
transclusion expansions as done in Parsoid.

Scott prefers to store rules in page and template content instead, and lets
rules leak out of templates.

Scott's '"Category" Proposal' seems not leaking?

GWicke added a comment.Via ConduitAug 15 2013, 11:09 PM

(In reply to comment #33)

Scott's '"Category" Proposal' seems not leaking?

The category variant in Scott's RFC is close to what I have been advocating for a while. He does not rule out leaking of rules out of templates, but mentions the problems associated with doing so. So it might or might not be leaking.

See https://www.mediawiki.org/wiki/Requests_for_comment/Page_and_category_based_language_variant_conversion for a more detailed write-up of my proposal.

cscott added a comment.Via ConduitAug 16 2013, 4:50 PM

Can you improve your RFC to specify more precisely the scoping you anticipate for 'global' rules? In particular, it seems that a global rule defined in a page *does* affect the content of templates included in the page (a sort of leak). What happens to when a template defines a global rule? Does it get added to the inherited global rules from the parent page, and then applies to any subtemplates?

FWIW, my Category proposal does state that page-scope templates do not leak -- neither into templates nor up to enclosing context. I'm not 100% sure that's desirable, but that's how it currently reads.)

dchan added a comment.Via ConduitSep 11 2013, 8:01 PM

I think we should be *extremely* restrictive about where language rules can leak. This is because they lead to several problems:

(1) Rule changes make it hard to give a faithful real-time view of *any* plaintext.

(2) Rule changes can cause unexpected errors in distant text.

(3) Few people can proofread both zh-Hans and zh-Hant. Therefore, almost anyone who makes an edit will be unable to proofread at least one of the variants it might affect.

On the other hand, leaking currently allows pages to import rules. I think we should preserve this facility but make it more separate.

  1. In general, there should be no leakage: rules should be page-global, and should not leak into or out of templates. This means template *arguments* should be subject to the rules of the page in which they are written, but text generated by a template should not.
  1. As an exception to the "no leakage" rule, there should be a new type of template called a Glossary, whose only purpose is to leak rules into the calling page. That way, language rules are completely separate and independent of any other template behaviour. These Glossaries should be referenced at the top of the page only.
  1. The page which defines a template is free to use rules and Glossaries too. But they will only affect the text generated by the template itself -- they won't leak into any text defined in the calling page. This includes the arguments passed into the template, because they're written in the calling page.

As you can see, this is just cscott's "Global" proposal, but with the additional restriction that the templates that leak rules cannot have any other functionality.

GWicke added a comment.Via ConduitSep 13 2013, 7:04 AM

David, Roan, Scott, Subbu and me met in the office to discuss this. Short summary of the plans for the next steps:

  1. Find nesting issues and see if we can fix them up with a bot. Also investigate use cases for markup in variant conversion rules.
  1. Parse all -{ }- syntax and represent it in the DOM. Exact spec TBD in https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks. Render the default variant according to the fallback chain for output-producing rules.
  1. Enable editing of inline (once-only) rules in the VE. Most rule table modifications seem to be templated and will not be applied, so are not directly relevant. Rules that only modify the table but produce no output directly in page content can be represented as mw:Placeholder and will simply be preserved.

This will make the VE usable for typical editors on variant-enabled wikis without requiring the variant conversion overhaul to be done first.

For the longer-term strategy, we (mostly) agreed on:

  1. Add the capability to associate an ordered list of glossaries with a page. These can either be stored in a separate namespace, or something like Special:Glossary. They should be revision-controlled and machine-readable for processing and UI purposes (JSON).
  1. Add the capability to add page-specific rules that override glossary rules. Only glossaries and global rules associated with the top-level page itself are considered. This makes the set of conversion rules independent of dynamic template expansions.
  1. Apply the combined rule set to the entire page including templated content. Rationales:
  2. Simple mental model
  3. efficient to implement
  4. consistent conversion of passed-in content, even if it is massaged further during transclusion expansion
  5. content in templates (labels, also real content in some infoboxes) themselves can still be protected or converted differently with local inline rules, as is done right now

The details on how this can be implemented depend on whether we reach our goal of implementing multi-part revision storage that we can use for metadata by the next quarter.

PS @David: Conversion rules should be passed into a pure function that converts each template expansion. Nothing at all should leak- otherwise our function would no longer be pure, and we could no longer efficiently update template expansions independently.

liangent added a comment.Via ConduitSep 13 2013, 12:09 PM

(In reply to comment #37)

David, Roan, Scott, Subbu and me met in the office to discuss this. Short
summary of the plans for the next steps:

  1. Find nesting issues and see if we can fix them up with a bot. Also investigate use cases for markup in variant conversion rules.

Why do we want to get rid of nested -{}- markups? It's useful in some cases.

See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed replacement for [[Template:地区用词]]).

Try to expand a [[Template:地区用词3]] call and see its result:

{{地区用词3|zh-cn=cn|zh-tw=tw}}

ssastry added a comment.Via ConduitSep 13 2013, 3:13 PM

(In reply to comment #38)

(In reply to comment #37)
> David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> summary of the plans for the next steps:
>
> 1) Find nesting issues and see if we can fix them up with a bot. Also
> investigate use cases for markup in variant conversion rules.

Why do we want to get rid of nested -{}- markups? It's useful in some cases.

See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
replacement for [[Template:地区用词]]).

Try to expand a [[Template:地区用词3]] call and see its result:

{{地区用词3|zh-cn=cn|zh-tw=tw}}

Liangent: Gwicke did not fully explain the nesting issue we were talking about.

What we had in mind was use in attributes. Ex:-{zh-cn=<span style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are proposing using a bot to fix this to: <span style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form has the property that all HTML snippets have a well-formed DOM representation whereas the original does not.

liangent added a comment.Via ConduitSep 13 2013, 3:54 PM

(In reply to comment #39)

(In reply to comment #38)
> (In reply to comment #37)
> > David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> > summary of the plans for the next steps:
> >
> > 1) Find nesting issues and see if we can fix them up with a bot. Also
> > investigate use cases for markup in variant conversion rules.
>
> Why do we want to get rid of nested -{}- markups? It's useful in some cases.
>
> See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
> replacement for [[Template:地区用词]]).
>
> Try to expand a [[Template:地区用词3]] call and see its result:
>
> {{地区用词3|zh-cn=cn|zh-tw=tw}}

Liangent: Gwicke did not fully explain the nesting issue we were talking
about.

What we had in mind was use in attributes. Ex:-{zh-cn=<span
style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are
proposing
using a bot to fix this to: <span
style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form
has the property that all HTML snippets have a well-formed DOM representation
whereas the original does not.

Oh that's what we've discussed before - and that's fine.

gerritbot added a comment.Via ConduitOct 8 2013, 10:29 PM

Change 50767 abandoned by GWicke:
Revert "(bug 41716) Add variant config to siprop=general"

Reason:
This ship has sadly sailed. Too late to clean it up I guess. Sigh.

https://gerrit.wikimedia.org/r/50767

cscott added a comment.Via ConduitJun 17 2014, 3:49 PM

@liangent: can you describe the use cases for "the other kind" of nested markup?
That is, -{ }- inside -{ }-?

Our proposed DOM tree (https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks) can handle:

-{ foo -{ bar }- bat }-

but not

foo-{zh-cn:blog -{ nested }-; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

etc.

How are nested -{ }- markups of this sort actually used?

liangent added a comment.Via ConduitJun 17 2014, 3:55 PM

They're mostly used in template. See also [[zh:Template:DISPLAYTITLE]] and [[zh:Module:Template:地区用词]].

gerritbot added a comment.Via ConduitJun 17 2014, 10:07 PM

Change 140235 had a related patch set uploaded by Cscott:
WIP: parse language converter markup.

https://gerrit.wikimedia.org/r/140235

cscott added a comment.Via ConduitJun 18 2014, 3:33 PM

I've written up some notes about nested conversion blocks and other discoveries at https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks#Notes

Nemo_bis added a comment.Via ConduitSep 25 2014, 8:32 PM

(In reply to Liangent from comment #0)

Phase 1: Capsule conversion syntax (-{}- markups) into non-editable blocks
to avoid breakage.

(In reply to Gabriel Wicke from comment #2)

It would be good to look into implementing phase 1 (recognize and protect
language conversion content).

I understand this has some value in itself for PDF export, see bug 34919 comment 17.

Nemo_bis added a comment.Via ConduitOct 17 2014, 3:27 PM

(In reply to Nemo from comment #46)

I understand this has some value in itself for PDF export, see bug 34919
comment 17.

And more were filed, like bug 71815. Should they all depend from this?

Does this really depend on bug 43547? Maybe this should just be converted to a tracking bug so that we're free to add dependencies without hairsplitting.

ssastry moved this task to Needs Discussion on the Parsoid workboard.Via WebDec 9 2014, 5:23 PM
MarkTraceur removed a subscriber: MarkTraceur.Via WebDec 9 2014, 5:33 PM
ssastry moved this task to Q1 on the Parsoid workboard.Via WebFeb 2 2015, 2:50 PM
ssastry moved this task to Backlog on the Parsoid workboard.Via WebMay 20 2015, 9:41 PM
gerritbot added a subscriber: gerritbot.Via ConduitJun 4 2015, 8:40 PM

Change 50136 abandoned by Jforrester:
(bug 41716) Tokenize language variant conversions

Reason:
Wrong repo now.

https://gerrit.wikimedia.org/r/50136

Add Comment