MediaWiki needs a fictitious variant for English for easier variant development work
Closed, ResolvedPublic

Description

Let's start 2013 with a fun bug.

It would be very useful when working on the skins or other bits of the interface.

Currently, to test variants handling, you have to switch the entire wiki to a language that uses variant conversion, and unfortunately it seems like all of them use non-Latin alphabets, which makes working with them considerably harder if know them than if they were in a latin script at least.

With this, you would just set $wgUsePigLatinVariant (or something similar) to true and be able to test.

Of course we could use piratey or lolcat language instead of pig latin, but it seems the nicest to me (and easier to implement).

See Also:

Details

Reference
bz43547
bzimport raised the priority of this task from to Low.
bzimport set Reference to bz43547.
bzimport added a subscriber: Unknown Object (MLST).
matmarex created this task.Jan 1 2013, 12:24 AM

Please don't mark bugs as resolved (particularly resolved/wontfix) without a comment. Re-opening for now.

Dereckson added a comment.EditedJan 1 2013, 12:38 AM

If we want to explore this option, I would recommend to take a look to two small applications: Jive and valspeak, as there are very easy to implement.

I have to note these applications could be in violation with modern expectation of decency and politeness.

For example, the jive application [1] is built with the following code:

jive.c
#include <stdio.h>

char *yylex();

int main()
{
        char *line;
        while(line = yylex()){
                printf("%s", line);
        }
        return 0;
}

And this lex list:
http://stuff.mit.edu/afs/sipb/user/rfrench/src/jive/jive.l

The valspeak lex list:
https://groups.google.com/group/net.sources/msg/be14e2cfcdf7eb06?dmode=source

[1] https://groups.google.com/group/net.sources/tree/browse_frm/month/1986-10/e5cbd0d14f430065?rnum=81&_done=/group/net.sources/browse_frm/month/1986-10?&pli=1

Strangely, the jive application reaches FreeBSD port, but not the valspeak.

(In reply to comment #1)

Please don't mark bugs as resolved (particularly resolved/wontfix) without a
comment.

Comment: looks like what you want is rather bug 38486.

(In reply to comment #3)

Comment: looks like what you want is rather bug 38486.

While slightly similar, this doesn't seem like the same thing to me (although solutions to one of them could solve the other as well, in some cases).

This bug is from the perspective of a skin developer, that one is from an extension developer. I went through some pain to implement interface for variants in CologneBlue when I was rewirting it, and I've been sitting on this idea since.

Well, I'm not sure. I think we should just try implementing something ;) (I could work on this pig Latin variant if there was a reasonable chance of it getting merged).

[Reopening for now; maybe it's a dupe, but certainly not a wontfix, is it?]

The language converter works on chars or substrings instead of (space-delimited) words, so I don't think it's easy to adapt a word-based conversion here (correct me if I'm wrong). Since this is not a serious variant development, maybe we can choose l33t or something similar instead?

Hey, just a converter between UPPERCASE and lowercase is enough.

Btw, In my experiance serbian is best for testing things when your eyes are used to a latin script

The converter works abstractly on text not individual characters. So I'd think it should be completely fine to do word based conversions.

Heck, I'm already doing word based conversions. I hacked the variant system with a custom language to make it so that character names in a certain animanga wiki can vary by what version of the series you've been accustomed to (The original romaji, manga translation, subtitled translations, dub translations, and translations of alternate dubs can have different translations for the same character). Though a recent upgrade may have broke some of the code I used to hackily load the language variants in to mw in the first place.

Note that we have another bug contemplating the idea of implementing US, Canada, UK, etc... English as variants.

(In reply to comment #9)

The converter works abstractly on text not individual characters. So I'd
think
it should be completely fine to do word based conversions.

Heck, I'm already doing word based conversions. I hacked the variant system
with a custom language to make it so that character names in a certain
animanga
wiki can vary by what version of the series you've been accustomed to (The
original romaji, manga translation, subtitled translations, dub translations,
and translations of alternate dubs can have different translations for the
same
character). Though a recent upgrade may have broke some of the code I used to
hackily load the language variants in to mw in the first place.

Note that we have another bug contemplating the idea of implementing US,
Canada, UK, etc... English as variants.

Yeah you have to override LanguageConverter::translate() then, and can't make use of the default conversion table holder (ReplacementArray).

Change 72053 had a related patch set uploaded by Liangent:
(bug 43547) New language variant en-x-piglatin DO NOT MERGE

https://gerrit.wikimedia.org/r/72053

(In reply to comment #9)

..
Note that we have another bug contemplating the idea of implementing US,
Canada, UK, etc... English as variants.

bug 31015? It would be useful to start that set of variants, with just the basic transforms; the implementation of a viable set of conversions could then develop over time.

Liuxinyu970226 set Security to None.Oct 2 2015, 1:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2015, 1:21 PM

Change 280927 had a related patch set uploaded (by Nemo bis):
Convert all content text into Pig Latin

https://gerrit.wikimedia.org/r/280927

He7d3r updated the task description. (Show Details)Apr 2 2016, 3:06 PM

Change 280927 abandoned by Kaldari:
Convert all content text into Pig Latin

Reason:
April Fools Day is over.

https://gerrit.wikimedia.org/r/280927

Change 72053 had a related patch set uploaded (by Bartosz Dziewoński):
New language variant 'en-x-piglatin' for easier variant testing

https://gerrit.wikimedia.org/r/72053

Relevant Gerrit comment today from @cscott:

Rebased. I'll see if I can fix some of the problems here. RobLa is interested in this for T143628.

(In reply to comment #9)

The converter works abstractly on text not individual characters. So I'd
think
it should be completely fine to do word based conversions.

Heck, I'm already doing word based conversions. I hacked the variant system
with a custom language to make it so that character names in a certain
animanga
wiki can vary by what version of the series you've been accustomed to (The
original romaji, manga translation, subtitled translations, dub translations,
and translations of alternate dubs can have different translations for the
same
character). Though a recent upgrade may have broke some of the code I used to
hackily load the language variants in to mw in the first place.

Note that we have another bug contemplating the idea of implementing US,
Canada, UK, etc... English as variants.

Yeah you have to override LanguageConverter::translate() then, and can't make use of the default conversion table holder (ReplacementArray).

I'm interested in exploring Liangent's comment above as well too -- there are really good cross-language word-segmentation tools in libicu, for example: http://php.net/manual/en/class.intlbreakiterator.php

(In fact there are transliteration tools there, too: http://icu-project.org/apiref/icu4c/classicu_1_1Transliterator.html , which might be adequate replacements for some of our hand-rolled latin-cyrillic conversions.)

It would be nice to move some of the guts of language converter out to ICU, which is broadly maintained and is can be bound from just about every programming language imaginable (including nodeJS, cf https://www.npmjs.com/package/icu-bidi), which makes it easier for the wider world to interoperate with the language converter functionality, and for us to get more contributions from the wider world for obscurer languages.

It seems that what Liangent is saying is that word-based conversion is technically possible, but the abstract facilities present don't support it, so you have to "roll your own". So I guess I'm proposing to build out an abstract superclass for "word-based conversion" which implements the abstract word-splitting functions using the ICU break iterator functions. And then x-piglatin -- as well as the other European transliterations -- could inherit from the "word-based converter" class.

Assuming that ICU's break iterator does something sensible for Chinese, perhaps the existing zhwiki code could be refactored to use the new abstract class as well.

And, looking forward, I'd like to use something like the ICU's sentence-break-iterator function to do selective (re)serialization when editing in a variant. So if you edit in your variant, the sentence containing your edit would get saved "the way you saw it when you edited it", but the other sentences in the paragraph would be left alone. This is slightly finer-grained selective serialization than Parsoid currently does -- Parsoid currently looks at HTML nodes, so at the paragraph (<p> tag) level, not down at the sentence level. (Under the covers I might implement this by wrapping each sentence in a synthetic <span> tag using the ICU sentence-break-iterator, using Parsoid's selective serialization, then stripping the synthetic <spans>s.)

Jdforrester-WMF closed this task as Resolved.Jun 14 2017, 11:50 PM
Jdforrester-WMF assigned this task to liangent.

Change 72053 merged by jenkins-bot:
[mediawiki/core@master] New language variant 'en-x-piglatin' for easier variant testing

https://gerrit.wikimedia.org/r/72053

Hooray!

It should be noted that while the patch was written by @liangent, it took a lot of Parser/LanguageConverter work from @cscott for the English-language parser tests to stop failing with a language converter active. This is why it took us three years to merge this :)

Johan added a subscriber: Johan.Jun 15 2017, 9:33 PM

@cscott Just making sure, since it was marked with user-notice (thank you, always appreciated): Is this relevant to folks who wouldn't be reached by an email to engineeering-l and/or wikitech-l? I.e. folks who deal with local scripts on their specific wikis and so on?

@Johan Perhaps. I thought it might be a fun filler if you have space, since pig latin is fun! But also the first step in educating more English-literate developers how the variant conversion system in mediawiki works. Low priority for tech news, but maybe there will be a slow week for news.

OK, I'll revisit it tomorrow and decide. Thank you for tagging it either way. (:

(Not too many options, but already more technical than usual this week.)