Support language variant conversion in Parsoid
Open, LowPublic

Description

I'm not expecting this will happen soon. Just leave this bug here.

The required features are describe below. Some may belong to VisualEditor.

Phase 1: Capsule conversion syntax (-{}- markups) into non-editable blocks to avoid breakage.

Phase 2: Enable editing of these conversion blocks.

Phase 3: Convert all text in DOM to requested variant, and convert it back to original variant when constructing wikitext. Don't change text to another variant if user doesn't edit that word in DOM. Shadowing whole text may be needed here.


Version: unspecified
Severity: normal

bzimport set Reference to bz41716.
liangent created this task.Nov 2 2012, 10:45 PM

Oh, that will be fun ;)

It would be good to look into implementing phase 1 (recognize and protect language conversion content).

Liangent, can you please link us to documentation about how this works? Initial searches have been less than fruitful.

Next question: Should a construct like {{variantopen}}令{{variantclose}} work (assume it expands to -{令}-)? If not, would it be difficult to phase that construct out as deprecated and go forward with Parsoid not supporting it?

Thanks for your help, we'd love to get Parsoid working with zh-wikis.

(In reply to comment #3)

Liangent, can you please link us to documentation about how this works?
Initial
searches have been less than fruitful.

Do you means how it's done in the PHP parsing process, or what is expected to be done (specification of related syntax)?

(In reply to comment #4)

Next question: Should a construct like {{variantopen}}令{{variantclose}} work
(assume it expands to -{令}-)? If not, would it be difficult to phase that
construct out as deprecated and go forward with Parsoid not supporting it?

Thanks for your help, we'd love to get Parsoid working with zh-wikis.

That construct works in the PHP converter.

We really need to know about how this is *supposed* to go, and we need English documentation for it if our team is going to work on it. The current offerings are all in other languages I think.

A few notes from IRC:

[09:59] <gwicke> marktraceur: I browsed the LanguageConverter source a bit
[09:59] <gwicke> there is an autoConvert method that simply converts all text based on a dictionary lookup
[10:00] <gwicke> it only excludes markup and script/code blocks
[10:00] <gwicke> the default search language for Chinese seems to be zh-hans (simplified)
[10:01] <gwicke> am not sure when the special conversion syntax is used in practice
[10:03] --> tewwy has joined this channel (~tychay@wikimedia/Tychay).
[10:03] <gwicke> conversion is restricted to those blocks when using convert() and convertTo()
[10:03] <gwicke> plus special conversion for link targets and headings
[10:04] <gwicke> the conversion itself is performed using autoConvert (the dictionary-based method)
[10:06] * cscott is reading backlog
[10:07] <cscott> yeah, i mentioned getting minority-language buy-in in the meeting yesterday, thinking specifically of how hard it's been to get i18n feedback
[10:07] --> HaeB has joined this channel (~quassel@wikipedia/HochaufeinemBaum).
[10:08] <cscott> this languageconverter thing is changing simplified chinese to traditional, and vice-versa? ie, mainland-to-taiwan and back?
[10:13] <gwicke> cscott: there are four variants for Chinese I think
[10:13] <gwicke> Serbian and some other languages have variants too
[10:14] <gwicke> marktraceur: so my reading is that normally convert() is used, which only converts marked-up blocks (-{ }-)
[10:15] <gwicke> except for search, which uses autoconvert directly
[10:16] <gwicke> the conversion is also lossy, but less ambiguous when converting from traditional to simplified for example
[10:16] <gwicke> now the question is how we should represent all this in the DOM
[10:20] *** edsanders|away is now known as edsanders.
[10:20] <gwicke> on one hand it would be nice to abstract the issue, but with the conversion being lossy that does not seem to be possible without preserving the original (potentially mixed-variant) text

(In reply to comment #9)

A few notes from IRC:

Let me explain more:

The main entry point should be convertTo(), with convert() as a shortcut to use the "preferred" (= automatically guessed from request) variant. It accepts an almost-parsed HTML document (string) with -{}- markups embedded.

convertTo() is just a loader. It calls recursiveConvert* afterwards, which parse -{}- syntax, and break text into pieces based on -{}- markups. These pieces are fed into autoConvert().

autoConvert() extracts text snippets which actually need conversion (with HTML tags, <code> blocks etc. excluded, but include "title" attribs in HTML tags again...), then send these snippets to translate().

translate() transforms text finally using strtr()-like mechanism.

(In reply to comment #12)

Some more info:
http://www.mediawiki.org/wiki/Parsoid/Language_conversion

Maybe you want to avoid pasting IPs in those join-messages onto the wiki next time. :)

The channel is public anyway, but pasting them on the wiki certainly makes it easier to search for names. It might be a good idea for you to get an IRC hostmask cloak, so that the IP does not show up in IRC logs.

(In reply to comment #14)

The channel is public anyway, but pasting them on the wiki certainly makes it
easier to search for names. It might be a good idea for you to get an IRC
hostmask cloak, so that the IP does not show up in IRC logs.

I already have one, but I often see this happening:

[09:23] --> spectie has joined this channel (~fran@***).
[09:23] <-- spectie has left this server (Changing host).
[09:23] --> spectie has joined this channel (~fran@unaffiliated/spectie).

I guess it happpens when the user /msg nickserv identify xxx after they joins the channel, and the sequence is usually decided by their IRC client.

About global state of dictionaries: the table affected by -{H| }- is used for link & categorylink resolution too. We may want to keep this behavior.

[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]

(In reply to comment #16)

About global state of dictionaries: the table affected by -{H| }- is used for
link & categorylink resolution too. We may want to keep this behavior.

One thing more about -{H| }-: the current behavior is that it only affects text after it and this behavior is sometimes deliberately used. We may want to keep it.

GWicke added a comment.Jul 9 2013, 2:40 AM

(In reply to comment #18)

(In reply to comment #16)
> About global state of dictionaries: the table affected by -{H| }- is used for
> link & categorylink resolution too. We may want to keep this behavior.

One thing more about -{H| }-: the current behavior is that it only affects
text
after it and this behavior is sometimes deliberately used. We may want to
keep
it.

For us mutable global state is very hard to support in any sane way. Having page-global dictionary definitions or self-contained manual conversions is fine, but changing global state in the middle of the page (even from a dynamically changing template) conflicts with a lot of optimizations and is hard to represent in a UI.

  • Bug 51325 has been marked as a duplicate of this bug. ***

Changed the title back to "Support language variant conversion in Parsoid" as this is not just about the syntax.

(In reply to comment #21)

Changed the title back to "Support language variant conversion in Parsoid" as
this is not just about the syntax.

There're too many things, far more from what I mentioned in comment 0 and I may be going to add some separate bugs from time to time... or do you want to use this one as some "meta" bug?

Yes, this is the meta bug that depends on several other bugs (see the "Depends on" field).

Once we have a good overview of the issues we should probably get together to discuss possible solutions. Will you be at Wikimania?

cscott added a comment.Aug 9 2013, 2:40 AM

See also bug 52661 -- the language converter should be integrated better with the preprocessor, in both PHP and Parsoid.

(In reply to comment #24)

See also bug 52661 -- the language converter should be integrated better with
the preprocessor, in both PHP and Parsoid.

The language converter is actually a post-processor rather than a preprocessor. Why should that change?

(In reply to comment #26)

(In reply to comment #24)
> See also bug 52661 -- the language converter should be integrated better with
> the preprocessor, in both PHP and Parsoid.

The language converter is actually a post-processor rather than a
preprocessor.
Why should that change?

Their point is to have the preprocess understand those markups, to avoid interpreting them as something else.

@GWicke wrt comment 26 -- because it would fix the bugs documented in bug 52661. (In particular, <gallery> is in sad shape right now.)

See also http://www.mediawiki.org/wiki/Requests_for_comment/Scoped_language_converter

gwicke has an alternate proposal, which I'm sure he'll link here at some point.

As I understand it, we will parse the language converter markup, and then we will have a post-processing step which will actually apply the rules and markup to convert the text into the desired variant. As discussed (to some extent) in bug 15161, ideally visual editor would present the text in the user's preferred variant and then we would leverage the selser mechanism to ensure a change in variant applies only to the edited portion of the text. Again, ideally DOM blocks would be annotated with the variant they were written in.

A rough outline of my proposal as developed during Wikimania with Liangent is at http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks. Scott would like to add additional syntax, while I am proposing a two-phase approach that 1) aims at supporting visual editing of existing content and 2) builds the infrastructure for clean language variant conversion based on page-global and category-global rules and then migrates dynamic rule table modifications out of articles and templates.

(In reply to comment #30)

A rough outline of my proposal as developed during Wikimania with Liangent is
at
http://www.mediawiki.org/wiki/Parsoid/
MediaWiki_DOM_spec#Language_conversion_blocks.
Scott would like to add additional syntax, while I am proposing a two-phase
approach that 1) aims at supporting visual editing of existing content and 2)
builds the infrastructure for clean language variant conversion based on
page-global and category-global rules and then migrates dynamic rule table
modifications out of articles and templates.

However this plan is only doable after Parsoid become the default parser, and all migration process must be done at exactly the same time as Parsoid becoming so... to keep everything working. Because (1) PHP parser doesn't understand your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

(In reply to comment #31)

However this plan is only doable after Parsoid become the default parser, and
all migration process must be done at exactly the same time as Parsoid
becoming
so... to keep everything working. Because (1) PHP parser doesn't understand
your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

No, this is not depending on Parsoid becoming the default parser. It does depend on efficient access to global conversion rules at parse time, which is true for both approaches.

The main difference is that I favor direct (and mostly automatic) migration of rules to versioned page metadata for efficient access and gadget / UI-based editing. The processing model is also designed to be efficient with independent transclusion expansions as done in Parsoid.

Scott prefers to store rules in page and template content instead, and lets rules leak out of templates.

(In reply to comment #32)

(In reply to comment #31)
> However this plan is only doable after Parsoid become the default parser, and
> all migration process must be done at exactly the same time as Parsoid
> becoming
> so... to keep everything working. Because (1) PHP parser doesn't understand
> your schema (2) Parsoid doesn't understand PHP parser's -{A| }- markups.

No, this is not depending on Parsoid becoming the default parser. It does
depend on efficient access to global conversion rules at parse time, which is
true for both approaches.

The main difference is that I favor direct (and mostly automatic) migration
of
rules to versioned page metadata for efficient access and gadget / UI-based
editing. The processing model is also designed to be efficient with
independent
transclusion expansions as done in Parsoid.

Scott prefers to store rules in page and template content instead, and lets
rules leak out of templates.

Scott's '"Category" Proposal' seems not leaking?

(In reply to comment #33)

Scott's '"Category" Proposal' seems not leaking?

The category variant in Scott's RFC is close to what I have been advocating for a while. He does not rule out leaking of rules out of templates, but mentions the problems associated with doing so. So it might or might not be leaking.

See https://www.mediawiki.org/wiki/Requests_for_comment/Page_and_category_based_language_variant_conversion for a more detailed write-up of my proposal.

Can you improve your RFC to specify more precisely the scoping you anticipate for 'global' rules? In particular, it seems that a global rule defined in a page *does* affect the content of templates included in the page (a sort of leak). What happens to when a template defines a global rule? Does it get added to the inherited global rules from the parent page, and then applies to any subtemplates?

FWIW, my Category proposal does state that page-scope templates do not leak -- neither into templates nor up to enclosing context. I'm not 100% sure that's desirable, but that's how it currently reads.)

dchan added a comment.Sep 11 2013, 8:01 PM

I think we should be *extremely* restrictive about where language rules can leak. This is because they lead to several problems:

(1) Rule changes make it hard to give a faithful real-time view of *any* plaintext.

(2) Rule changes can cause unexpected errors in distant text.

(3) Few people can proofread both zh-Hans and zh-Hant. Therefore, almost anyone who makes an edit will be unable to proofread at least one of the variants it might affect.

On the other hand, leaking currently allows pages to import rules. I think we should preserve this facility but make it more separate.

  1. In general, there should be no leakage: rules should be page-global, and should not leak into or out of templates. This means template *arguments* should be subject to the rules of the page in which they are written, but text generated by a template should not.
  1. As an exception to the "no leakage" rule, there should be a new type of template called a Glossary, whose only purpose is to leak rules into the calling page. That way, language rules are completely separate and independent of any other template behaviour. These Glossaries should be referenced at the top of the page only.
  1. The page which defines a template is free to use rules and Glossaries too. But they will only affect the text generated by the template itself -- they won't leak into any text defined in the calling page. This includes the arguments passed into the template, because they're written in the calling page.

As you can see, this is just cscott's "Global" proposal, but with the additional restriction that the templates that leak rules cannot have any other functionality.

David, Roan, Scott, Subbu and me met in the office to discuss this. Short summary of the plans for the next steps:

  1. Find nesting issues and see if we can fix them up with a bot. Also investigate use cases for markup in variant conversion rules.
  1. Parse all -{ }- syntax and represent it in the DOM. Exact spec TBD in https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Language_conversion_blocks. Render the default variant according to the fallback chain for output-producing rules.
  1. Enable editing of inline (once-only) rules in the VE. Most rule table modifications seem to be templated and will not be applied, so are not directly relevant. Rules that only modify the table but produce no output directly in page content can be represented as mw:Placeholder and will simply be preserved.

This will make the VE usable for typical editors on variant-enabled wikis without requiring the variant conversion overhaul to be done first.

For the longer-term strategy, we (mostly) agreed on:

  1. Add the capability to associate an ordered list of glossaries with a page. These can either be stored in a separate namespace, or something like Special:Glossary. They should be revision-controlled and machine-readable for processing and UI purposes (JSON).
  1. Add the capability to add page-specific rules that override glossary rules. Only glossaries and global rules associated with the top-level page itself are considered. This makes the set of conversion rules independent of dynamic template expansions.
  1. Apply the combined rule set to the entire page including templated content. Rationales:
  2. Simple mental model
  3. efficient to implement
  4. consistent conversion of passed-in content, even if it is massaged further during transclusion expansion
  5. content in templates (labels, also real content in some infoboxes) themselves can still be protected or converted differently with local inline rules, as is done right now

The details on how this can be implemented depend on whether we reach our goal of implementing multi-part revision storage that we can use for metadata by the next quarter.

PS @David: Conversion rules should be passed into a pure function that converts each template expansion. Nothing at all should leak- otherwise our function would no longer be pure, and we could no longer efficiently update template expansions independently.

(In reply to comment #37)

David, Roan, Scott, Subbu and me met in the office to discuss this. Short
summary of the plans for the next steps:

  1. Find nesting issues and see if we can fix them up with a bot. Also investigate use cases for markup in variant conversion rules.

Why do we want to get rid of nested -{}- markups? It's useful in some cases.

See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed replacement for [[Template:地区用词]]).

Try to expand a [[Template:地区用词3]] call and see its result:

{{地区用词3|zh-cn=cn|zh-tw=tw}}

(In reply to comment #38)

(In reply to comment #37)
> David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> summary of the plans for the next steps:
>
> 1) Find nesting issues and see if we can fix them up with a bot. Also
> investigate use cases for markup in variant conversion rules.

Why do we want to get rid of nested -{}- markups? It's useful in some cases.

See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
replacement for [[Template:地区用词]]).

Try to expand a [[Template:地区用词3]] call and see its result:

{{地区用词3|zh-cn=cn|zh-tw=tw}}

Liangent: Gwicke did not fully explain the nesting issue we were talking about.

What we had in mind was use in attributes. Ex:-{zh-cn=<span style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are proposing using a bot to fix this to: <span style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form has the property that all HTML snippets have a well-formed DOM representation whereas the original does not.

(In reply to comment #39)

(In reply to comment #38)
> (In reply to comment #37)
> > David, Roan, Scott, Subbu and me met in the office to discuss this. Short
> > summary of the plans for the next steps:
> >
> > 1) Find nesting issues and see if we can fix them up with a bot. Also
> > investigate use cases for markup in variant conversion rules.
>
> Why do we want to get rid of nested -{}- markups? It's useful in some cases.
>
> See [[模块:Template:地区用词]] which has a wrapper at [[Template:地区用词3]] (proposed
> replacement for [[Template:地区用词]]).
>
> Try to expand a [[Template:地区用词3]] call and see its result:
>
> {{地区用词3|zh-cn=cn|zh-tw=tw}}

Liangent: Gwicke did not fully explain the nesting issue we were talking
about.

What we had in mind was use in attributes. Ex:-{zh-cn=<span
style='color:red';zh-tw=<span style='color:green'}-foo</span>. We are
proposing
using a bot to fix this to: <span
style='-{zh-cn=color:red;zh-tw=color:green}-'>foo</span>. The rewritten form
has the property that all HTML snippets have a well-formed DOM representation
whereas the original does not.

Oh that's what we've discussed before - and that's fine.

Change 50767 abandoned by GWicke:
Revert "(bug 41716) Add variant config to siprop=general"

Reason:
This ship has sadly sailed. Too late to clean it up I guess. Sigh.

https://gerrit.wikimedia.org/r/50767

@liangent: can you describe the use cases for "the other kind" of nested markup?
That is, -{ }- inside -{ }-?

Our proposed DOM tree (https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks) can handle:

-{ foo -{ bar }- bat }-

but not

foo-{zh-cn:blog -{ nested }-; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

etc.

How are nested -{ }- markups of this sort actually used?

They're mostly used in template. See also [[zh:Template:DISPLAYTITLE]] and [[zh:Module:Template:地区用词]].

Change 140235 had a related patch set uploaded by Cscott:
WIP: parse language converter markup.

https://gerrit.wikimedia.org/r/140235

I've written up some notes about nested conversion blocks and other discoveries at https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Language_conversion_blocks#Notes

(In reply to Liangent from comment #0)

Phase 1: Capsule conversion syntax (-{}- markups) into non-editable blocks
to avoid breakage.

(In reply to Gabriel Wicke from comment #2)

It would be good to look into implementing phase 1 (recognize and protect
language conversion content).

I understand this has some value in itself for PDF export, see bug 34919 comment 17.

(In reply to Nemo from comment #46)

I understand this has some value in itself for PDF export, see bug 34919
comment 17.

And more were filed, like bug 71815. Should they all depend from this?

Does this really depend on bug 43547? Maybe this should just be converted to a tracking bug so that we're free to add dependencies without hairsplitting.

ssastry moved this task from Backlog to Needs Discussion on the Parsoid board.Dec 9 2014, 5:23 PM
ssastry moved this task from Needs Discussion to In Progress on the Parsoid board.Feb 2 2015, 2:50 PM
ssastry moved this task from In Progress to Backlog on the Parsoid board.May 20 2015, 9:41 PM

Change 50136 abandoned by Jforrester:
(bug 41716) Tokenize language variant conversions

Reason:
Wrong repo now.

https://gerrit.wikimedia.org/r/50136

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 25 2015, 8:36 AM
jmadler added a subscriber: jmadler.Jan 6 2016, 5:13 AM
GWicke added a subscriber: bearND.Jan 6 2016, 6:27 PM

Add Comment