Page MenuHomePhabricator

Multilingual JavaScript
Open, Needs TriagePublicFeature

Description

It would be nice if global templates had a scripting language that was truly multilingual. By this I mean that every keyword, comment, variable name, and method call could be localized.

The only programming languages I'm aware of with this property are Scratch, Blockly, and eToys. (See https://www.wired.com/story/coding-is-for-everyoneas-long-as-you-speak-english/ )

However, I believe we can make a variant of JavaScript which is fully multilingual, and use this to write Scribunto modules (T61101).

Event Timeline

cscott created this task.Aug 17 2019, 8:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2019, 8:57 PM
cscott added a comment.EditedAug 17 2019, 9:05 PM

The basic idea is to remap:

var foo = "bar";
var obj = {};
obj.bar = 3;
return obj[foo];

to

const $1 = Symbol("bar");

var $2 = $1;
var $3 = {};
$3[$1] = 3;
return $3[$2];

Note that string constants (including implicit strings used by . notation, a.b is the same as a["b"]) are mapped into JavaScript Symbol objects.

The various $1/$2/$3 strings can be remapped into localized names:

$1 = en-us: bar, en-x-piglatin: arbay
$2 = en-us: foo, en-x-piglatin: oofay
$3 = en-us: obj, en-x-piglatin: objay

When this remapping is done, different unique Symbols might end up with the same localized name; the remapping adds suffixes (_1, _2, etc) as necessary to distinguish unique Symbols where necessary. It is the abstracted version (with $1, $2, etc) which is actually executed; we are thus guaranteed that every localized version will produce the same results.

@cscott: Is this the same as T150417: Allow users to code in localized programming languages and should be merged, or what's the difference?

I've added this as a subtask of T150417. That task seemed to focus specifically on localizing Lua; I'm looking at JavaScript in this task. Because we don't (yet) have a Scribunto API for JavaScript (T61101), we're not quite as constrained by backwards compatibility of existing code.

You could also imagine this JavaScript dialect could be used for (say) VisualEditor and other client-side code, but because we have an existing code base we'd need to port in that case that would constrain the problem more. I think for this task I don't need to be compatible with anything (including with non-localized JavaScript) and I can design the Scribunto/template API specifically to facilitate localization.

Anomie added a subscriber: Anomie.

I'm inclined to agree that this is a duplicate of T150417. Whether Lua or JavaScript, there are some significant complications to this sort of thing when you're dealing with something that's fundamentally a text file, which is probably why the three examples are all programmed using a custom editor oriented around dragging and dropping "blocks" instead, so the human-readable names really are arbitrary labels and the editor always knows exactly which code-object everything refers to.

There was more discussion of this at https://groups.google.com/a/wikimedia.org/forum/#!topic/parsing-team/PvZ6wAjYnVE, but as a private list I hesitate to copy wholesale from there. I'll try to copy my replies to relevant points without revealing anyone else's comments.


https://www.wired.com/story/coding-is-for-everyoneas-long-as-you-speak-english/

I note that the one example it gives regarding wikitext comes down to just the fact that you can use arbitrary scripts for template parameters when creating a template, rather than pointing to anything in wikitext itself like the #REDIRECT keyword.

The article goes into some detail about the transition from Latin to vernacular languages making it easier for people to learn to read and write. But it overlooks the downside. When everything is written in Latin, you only have to know Latin and you can read anything anyone else wrote and anyone else can read what you write. When everyone is writing in their own vernacular languages, then when you write in English then someone who only reads Italian, Russian, and/or Chinese can't read what you wrote (and the same for most other pairs of languages).

Yes, programming languages are more structured than human languages, particularly when you're translating the same language with different keywords rather than something like JS to PHP (which the article glosses as trivial, while from the Parsoid-PHP project we found it wasn't). But we still sort of have that Tower of Babel problem even in wikitext. If you write your wikitext with the canonical English keywords it'll work when copied to any other wiki. But if you're on dewiki and use German keywords, you'll run into problems if someone tries to copy your wikitext to a wiki in a different language.

The obvious solution to that problem would be to let every language's keywords work in every other language. But then you can run into collisions where different languages would want to use the same word for different things. To use HTML as an example, look at the <p> tag, meaning "paragraph". Cribbing from https://en.wiktionary.org/wiki/paragraph#Translations, in Dutch, Polish, and Norwegian you might want that tag to be <a> instead (short for alinea, akapit, and avsnitt), in Hungarian you might want <b> (bekezdés). Meanwhile, from https://en.wiktionary.org/wiki/bold#Translations, Estonian might want to use <p> (paks) to mean "bold". Then, too, depending on your programming language you might run into collisions for situations like where an English programmer wants to name a variable holding a paragraph "para" but it errors out because "para" is reserved as the Spanish translation of the "for" keyword.

Or you could have each source file declare its language, and include a feature to translate the keywords (probably a tokenize-and-reserialize) if someone wants to read it in a different language. But the keywords aren't usually the big problem. Lua only has 21 listed at https://www.lua.org/manual/5.1/manual.html#2.1, for example. Most of the translation comes in libraries (and also in documentation and error reporting): you need to also translate every class name, method name, parameter name, and so on. And your language needs to integrate robust i18n, to make it extremely easy for user-written libraries (including embedded text strings) to also be translated, and you need a community that will actually create all those translations of everything.


and you need a community that will actually create all those translations of everything.

[something that boils down to "we already have translatewiki"]

True. But would that community scale to translating all the additional things? Including the more technical context, versus free text intended for the general reader?

Plus translation of code names has considerations that text doesn't, including avoiding collisions and avoiding breaking references if a translation changes (cf. T209211).


The basic idea is to remap:

var foo = "bar";
var obj = {};
obj.bar = 3;
return obj[foo];

to

const $1 = Symbol("bar");
var $2 = $1;
var $3 = {};
$3[$1] = 3;
return $3[$2];

So what does someone actually editing code in this language see? The latter seems obfuscated almost to the point of incomprehensibility. If you try to translate to the former for display and then back for storage, how does the server know how to map things back to the abstract representation? Say the original contained a function fooBar() and the submitted edit has fooBaz() instead. Did the user intend to rename the function, or did they delete fooBar() and add fooBaz() as two separate changes in one edit?

Or even in the example given, if that string "bar" is used elsewhere too, should a change to it here be reflected everywhere else too? Or what happens if the code is like

var func = "somePrefix" + type;
return obj[func]();

rather than just a strange string constant?

cscott added a comment.EditedAug 19 2019, 7:03 PM

and you need a community that will actually create all those translations of everything.

Not everything will be translated, but everything should be translatable. That's the mantra. No built-in barriers.

Plus translation of code names has considerations that text doesn't, including avoiding collisions and avoiding breaking references if a translation changes (cf. T209211).

Taken care of by the compiler. Suffixes are automatically added as necessary to disambiguate; the underlying representation makes clear when two identifiers which happen to have identical translations are distinct, and that is enforced by the runtime. (new Symbol ('foo') !== new Symbol('foo')).

So what does someone actually editing code in this language see?

They see their native language.

The translated code is what is executed.

Or even in the example given, if that string "bar" is used elsewhere too, should a change to it here be reflected everywhere else too? Or what happens if the code is like

var func = "somePrefix" + type;
return obj[func]();

rather than just a strange string constant?

This is prevented by the runtime type system. What looks like a string is actually a Symbol; you can't add symbols or concatenate them.

This looks very much like JavaScript, but it's not quite JavaScript. That's why it's best to keep this distinct from T150417---allowing a compatibility break is a lot easier than trying to localize already-existing code.

I'll hack together a working demo at some point, which should make things clearer.

Not everything will be translated, but everything should be translatable. That's the mantra. No built-in barriers.

If only keywords and one or two libraries are translated, you wind up with a confusing mix of things in English and another language. Besides remembering the names of things, you also have to remember which language the concept is expressed in. At what point of incompleteness does barely-done i18n become worse than no i18n?

Plus translation of code names has considerations that text doesn't, including avoiding collisions and avoiding breaking references if a translation changes (cf. T209211).

Taken care of by the compiler. Suffixes are automatically added as necessary to disambiguate; the underlying representation makes clear when two identifiers which happen to have identical translations are distinct, and that is enforced by the runtime. (new Symbol ('foo') !== new Symbol('foo')).

So you'll see code full of "foo1" versus "foo2" in your editor, and have to remember to use that any place where you'd normally type "foo"? But only in this file, it might be the other way around (or even unprefixed) in a different module. And hopefully you don't already have a "foo2" that would collide with the suffixed "foo". Or I suppose it might be "foo_from_LibraryABC" and "foo_from_LibraryXYZ" all over the place, which at least mitigates some of that.

I suppose going the other way you're hoping that strong typing will allow parse-time static analysis to determine which "symbol" everything refers to. Here's a test case for you along those lines:

var func;
if ( something ) {
     func = "foo";
} else {
     func = "bar";
}
printf( "Calling function %s\n", func ); // or console.log or whatever
objOfClassABC[func]();
objOfClassXYZ[func]();

One variable, but not the same symbol being used. I suppose when I reopened that file your thing would have had to somehow or other managed to transform it to something like

var func_for_ClassABC;
var func_for_ClassXYZ;
if ( something ) {
     func_for_ClassABC = ClassABC.symbolTable.foo;
     func_for_ClassXYZ = ClassXYZ.symbolTable.foo;
} else {
     func_for_ClassABC = ClassABC.symbolTable.bar;
     func_for_ClassXYZ = ClassXYZ.symbolTable.bar;
}
printf( "Calling function %s\n", func_??? ); // I have no idea what you'd do here
objOfClassABC[func_for_ClassABC]();
objOfClassXYZ[func_for_ClassXYZ]();

although how it'll manage to correctly turn one "func" into two in every case I don't know.

Or even in the example given, if that string "bar" is used elsewhere too, should a change to it here be reflected everywhere else too? Or what happens if the code is like

var func = "somePrefix" + type;
return obj[func]();

rather than just a strange string constant?

This is prevented by the runtime type system. What looks like a string is actually a Symbol; you can't add symbols or concatenate them.

You're proposing a language with no string concatenation? That seems nearly unusable for something to be used with wikitext.

Or else you're proposing a language with some "strings" that are strings and others that are "Symbols", with no differentiating them at definition-time? That's seems pretty bad too. At least make these "Symbols" be explicitly declared as such somehow.

This looks very much like JavaScript, but it's quite JavaScript.

At what point do you lose enough of JS that calling it "JavaScript" is as bad as JavaScript including "Java" in its name back in the 90s? ;)

I'll hack together a working demo at some point, which should make things clearer.

Good luck. I think you'll need it. ;)

kchapman added subscribers: CCicalese_WMF, kchapman.

Dev Advocacy is this something you all want to push?

@CCicalese_WMF could you review this?

daniel changed the subtype of this task from "Task" to "Feature Request".Aug 20 2019, 6:10 PM

Dev Advocacy is this something you all want to push?

Speaking for myself as a human independent of my role on the Technical Engagement team... I think the idea of translatable/remappable source code to better support folks from various language backgrounds is great.

And I think that inventing a system locally within the Wikimedia movement is a really problematic idea. In my opinion, we already do too many "special" things in MediaWiki and Wikimedia which create barriers for learning. Not being able to search the internet for questions/answers, blog posts, books, videos created by the much larger software development community makes us special in a bad way rather than a good way. The Wikimedia movement does not have enough technical writing and curricula development specialists to support the technology that we have developed to date. Adding a net new thing that is as complex as localized programming languages in my opinion will lead to a new and complex system that has no external support and thus a huge burden for adoption. As a related example, look to Hack and Go as "new" programming languages and try to measure the number of calendar years and human years that have been needed to create them vs the same measures needed for them to reach any amount of "mainstream" adoption as documentation and education materials lag behind the leading edge of the technical implementation work.

I completely believe that we could muster sufficient engineering skills and hours to implement something, but I am very skeptical that we could then carry out the slower and more difficult work of creating the needed corpus of training and support materials, linters, testing frameworks, etc that would be needed to make the new language variant(s) viable for wide spread adoption within the Wikimedia wikis let alone the larger MediaWiki universe.

Dev Advocacy is this something you all want to push?

Not sure what "push" implies. While I like the proposed outcome I cannot judge what the implications / maintenance costs of the proposed implementation would be.

cscott added a comment.EditedAug 22 2019, 7:14 PM

This is prevented by the runtime type system. What looks like a string is actually a Symbol; you can't add symbols or concatenate them.

You're proposing a language with no string concatenation? That seems nearly unusable for something to be used with wikitext.

Deliberately so: T114454: [RFC] Visual Templates: Authoring templates with Visual Editor enforces code/data/layout separation so template "code" should never be manipulating wikitext strings. At most it should be manipulating "localizable" strings, which have methods (such as the infamous _(...) method) to turn them into "real" strings, but this sort of code should be discouraged. "Make the easy things easy and the hard things possible".

This looks very much like JavaScript, but it's not quite JavaScript.

At what point do you lose enough of JS that calling it "JavaScript" is as bad as JavaScript including "Java" in its name back in the 90s? ;)

And I think that inventing a system locally within the Wikimedia movement is a really problematic idea. In my opinion, we already do too many "special" things in MediaWiki and Wikimedia which create barriers for learning. Not being able to search the internet for questions/answers, blog posts, books, videos created by the much larger software development community makes us special in a bad way rather than a good way.

Agreed completely. The crux of this will be whether the resulting code is read/writable in native languages by someone who isn't super-aware of how the underlying translation system works or how its semantics subtly differ from JavaScript. If this can be done such that normal users don't see the edge cases, then I believe we can leverage community familiarity with JavaScript. If the result is not sufficiently similar to bog-standard JavaScript, then I agree its not worth doing. Anyway, that's the hypothesis I'd like to test.

The main difference in my prototype so far is that you need to explicitly import function names in order to ensure you're using the right (localized) symbol for them. That is, instead of:

import { A } from `someplace`;

... A.foo() ...

You need to write:

import { A, #foo } from `someplace`;

... A.foo() ...

and that's enough to ensure that A.foo() and B.foo() are disambiguated properly where that matters. If you don't import #foo you get a syntax error when .foo is parsed, as you would if you wrote A and A had not been defined (at least in modern JavaScript tooling; legacy JavaScript would resolve .foo and A to undefined and you'd get a runtime error).

For example, in a learning environment you'd probably want to make sure that console and #log are imported by default and that console.log() does some sensible thing when you give it a localizable string (aka Symbol) instead of a "real" string, so the newbie's console.log("Hello, world") would still work, albeit with some warnings about how you could change that to make sure that everyone around the world could see an appropriately translated message. The localized version would just be console.log(_("Hello, world")), not a huge change. (But we don't generally use console.log in template code anyway.)

eprodromou added a subscriber: eprodromou.

We discussed this in our feature requests conversation in CPT today. Although we think this is a laudable effort, and we would be supportive of someone building such a language pre-processor for Scribunto, it doesn't fit in scope for the work we do.

I'm un-tagging us from this discussion. If it gets completed, and there's more work for CPT to do, we'll get back into the conversation.

Whether Lua or JavaScript, there are some significant complications to this sort of thing when you're dealing with something that's fundamentally a text file, which is probably why the three examples are all programmed using a custom editor oriented around dragging and dropping "blocks" instead, so the human-readable names really are arbitrary labels and the editor always knows exactly which code-object everything refers to.

People have certainly tried to make non-english programming languages, I don't think any of them have ever really been popular though. For example perl has a latin version and a klingon version

http://users.monash.edu/~damian/papers/HTML/Perligata.html & https://metacpan.org/pod/Lingua::tlhInganHol::yIghun

Oh and there's this apparently http://www.babylscript.com/