Page MenuHomePhabricator

(Develop spec for) Test wiki for English language variants
Open, Needs TriagePublic

Description

A test wiki for English language variants would allow developers who don't know a language where variants are common to experience the feature. More is described on this page:

https://www.mediawiki.org/wiki/User:RobLa-WMF/Variants

(As of this writing in August 2016, this task is about fleshing out the page above, but will transition to a request once we're ready to request something)

Event Timeline

@cscott hinted on IRC that Pig Latin support may be appropriate for this wiki. See T17161#2354695 for the associated complexities.

Existing patchset for pig latin support: https://gerrit.wikimedia.org/r/72053 (probably partially code-rotted)

An alternative mechanism to handle variant conversion via translation: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Amazing_Article_Annotations -- it should be fairly easy to create a machine translation service between english and pig latin -- the opposite direction would need some "real" MT factored in there in order to resolve ambiguities, but simple bigram or trigram analysis should really be sufficient in 99% of the cases. For example, appleway -> apple not wapple because only the former is a real English word.

Next step might be to ping liangent about 72053 and see if we can resuscitate pig latin support, now that we have a concrete use case for it?

But personally this is lower priority for me (personally) until I have some compelling features wrt language conversion to demonstrate. In the same vein, the new "pig latin" wiki will initially have very little content -- even if we import articles from enwiki, few of them will have languageconverter markup and thus suitable for demonstration purposes.

So perhaps another necessary step is to figure out what exactly we want to demonstrate and what content we need in order to do so?

So perhaps another necessary step is to figure out what exactly we want to demonstrate and what content we need in order to do so?

We have several multi-lingual wikis (commons, wikidata, meta, etc). My understanding is that none of them support variants, nor do we have plans to. It seems we either need to fish or cut bait with variants. So few of us speak languages that use the feature that it's hard to have an informed conversation about which we should choose:
a) make our variant support more robust
b) deprecate variants, and migrate variant-using wikis to something else

My anglo-centric bias is "let's have an English wiki that we can use to tinker with this feature".

An example of a "more robust" version of this feature would be to tie it to wikidata, so that instead of:

The burger came with -{en-GB:chips,en-US:fries}-

...we could also have:

The burger came with -{en-GB:chips,wikidata:Q152088}-

...and then use Wikidata as a glossary hub for variants.

It would be kinda wild if we could deploy tech that would make this problem easier for editors of enwiki:
https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#National_varieties_of_English

Here's an interesting orthogonal consideration -- language converter is turned on by default on all wikis, but actually only parses language converter markup (-{ ... }-) if there is a variant defined for the current page language. Since english has no variants, that means that -{ }- on enwiki is not parsed as language converter markup, it appears as literal text.

As soon as you define a variant for english, like pig latin, the -{ }- markup would change its rendering on enwiki.

There is a mechanism to disable variants ($wgDisabledVariants), but disabling a variant does not prevent language converter from being enabled. (Perhaps core could be changed to do so.)

There is also a mechanism to disable language converter all together ($wgDisableLangConversion). If we are to go ahead with pig-latin as the "test wiki" version of language converter, to avoid changing the rendering of existing pages we might also have to (a) explicitly disable language converter in production (see the next comment for why this would be a bad idea!), or (b) disable the en-pig-latin in production and patch core to look for the presence of non-disabled variants before doing language conversion.

Or (my preference) we could (c) separate out the "parse language converter markup" decision from the "is a variant defined for this language", and first patch core to *always parse* language converter markup when $wgDisableLangConversion is false. My gut sense is that this shouldn't change the rendering of many pages, and those pages could be identified and fixed up if they matter. But it would eliminate a troublesome dependency where, in order to properly parse a page, we need to know both the currently defined page language *and* the set of variants currently defined in mediawiki-core. (And this set of variants may change over time, complicating archiving.)

So perhaps another necessary step is to figure out what exactly we want to demonstrate and what content we need in order to do so?

We have several multi-lingual wikis (commons, wikidata, meta, etc). My understanding is that none of them support variants, nor do we have plans to.

Not quite true. My understanding from looking at the wiki configuration is that *all* of our wikis have language variants enabled. However, the markup is only parsed if the page language corresponds to a language with a variant defined in core.

So a Foo page on meta wiki may *appear* not to have language converter turned on, but the translation page Foo/zh (for example) will have language converter enabled. See for example: https://www.mediawiki.org/wiki/Help:Extension:Translate/Translation_example/zh
which displays the language converter's "variant" drop down in the top menu.

An example of a "more robust" version of this feature would be to tie it to wikidata, so that instead of:

The burger came with -{en-GB:chips,en-US:fries}-

...we could also have:

The burger came with -{en-GB:chips,wikidata:Q152088}-

...and then use Wikidata as a glossary hub for variants.

My version of a "more robust" variants feature would be to tie is to better support of machine translation, and refactor our existing script conversion PHP code into "trivial" translation engines from (say) sr-ec to sr-el, etc. Then every time a change is made to the sr-ec wiki, "machine translation" steps in to suggest an equivalent change to sr-el, and this whole workflow is polished hard until it gleams brighter than the existing workflow using language converter.

A side effect is that maintaining minority language wikis based on edits to a majority language wiki, where the translation step is more interesting than simple script conversion, *also* becomes easier. And the reverse is true as well -- ideally the enwiki coverage of non-American and non-European places/concepts improves a lot as well because the enwiki editors can easily adopt machine-translated chunks from the local wikis.

It would be kinda wild if we could deploy tech that would make this problem easier for editors of enwiki:
https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#National_varieties_of_English

As you know, I have repeatedly suggested that our moral compass for "is variant support good enough" is "would enwiki be willing to use it for American, British, Indian, Australian, etc English". If the workflow is too complicated for enwiki, it is high hubris to force it on the rest of the world. (See "polished hard until it gleams" above.)

But that's only my own perspective.

@cscott, So, is this a good TL:DR; summary?

Language variants syntax works as a "parser-extension" that is enabled on a per-page basis. You are proposing that it should be enabled on a per-wiki basis.

@cscott, So, is this a good TL:DR; summary?

Language variants syntax works as a "parser-extension" that is enabled on a per-page basis. You are proposing that it should be enabled on a per-wiki basis.

That's a fine TL;DR summary. Not expected to affect many pages, unless -{ ... }- syntax is used for its literal characters on existing pages. A wiki grep should determine whether that's the case.

@cscott, So, is this a good TL:DR; summary?

Language variants syntax works as a "parser-extension" that is enabled on a per-page basis. You are proposing that it should be enabled on a per-wiki basis.

That's a fine TL;DR summary. Not expected to affect many pages, unless -{ ... }- syntax is used for its literal characters on existing pages. A wiki grep should determine whether that's the case.

Yes, I would be surprised if many pages were affected by the proposed change (which makes sense to me as well). This merits a separate ticket so the test wiki discussion on this ticket retains its focus. That would also let people flag any use cases for the current behavior that we aren't accounting for.

Language variants syntax works as a "parser-extension" that is enabled on a per-page basis. You are proposing that it should be enabled on a per-wiki basis.

Is that a blocker to enabling pig latin (ie. $wgUsePigLatinVariant = true;) on some test wiki (e.g. a new piglatin.wikimedia.beta.wmflabs.org)? Even if it's a per-page opt-in, it would be quite useful to be able to test language variant behavior somewhere, without having to learn Chinese.

Change 438079 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[operations/mediawiki-config@master] Enable testing LanguageConverter in sandboxes on deploymentwiki

https://gerrit.wikimedia.org/r/438079