Page MenuHomePhabricator

add <langconvert> parser tag
Open, Needs TriagePublic

Description

For the Balinese palm-leaf project grant I'm working on, I will need the ability to transliterate text in Balinese script to Latin script. The output will look something like this Palmleaf.org page.

The transliteration rules will be implemented in the new Balinese LanguageConverter class (under review). However, the existing LanguageConverter facilities are not sufficient, because I don't want to convert whole pages into either Balinese or Latin script. Rather, I want the Latin transliteration to supplement the Balinese original and appear below it. This means that I need a way to convert particular chunks of wikitext from one Balinese variant to another, and insert the result in a flexible manner.

To do this, I propose adding a <langconvert> tag to CoreParserTags.php to allow flexible access to LanguageConverter. It takes two attributes: from (language variant from) and to (language variant to). For example, <langconvert from="sr-el" to="sr-ec">zdravo</langconvert> would return "здраво" (Latin Serbian to Cyrillic Serbian).

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone

Event Timeline

kamholz created this task.Sep 17 2020, 12:30 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 17 2020, 12:30 AM

Change 627938 had a related patch set uploaded (by David Kamholz; owner: David Kamholz):
[mediawiki/core@master] implement #transliterate parser function

https://gerrit.wikimedia.org/r/627938

Bugreporter added a subscriber: Bugreporter.

It will also be useful in Chinese.

kamholz updated the task description. (Show Details)Sep 17 2020, 12:54 AM

Change 636108 had a related patch set uploaded (by David Kamholz; owner: David Kamholz):
[mediawiki/core@master] Implement Balinese language converter

https://gerrit.wikimedia.org/r/636108

kamholz updated the task description. (Show Details)Oct 24 2020, 12:14 AM
MBinder_WMF edited projects, added Parsoid (Tracking); removed Parsoid.Dec 10 2020, 8:12 PM

Change 627938 merged by jenkins-bot:
[mediawiki/core@master] Implement <langconvert> tag

https://gerrit.wikimedia.org/r/627938

kamholz renamed this task from add #transliterate parser function to add <langcovnert> parser tag.Dec 15 2020, 8:40 PM
kamholz updated the task description. (Show Details)
Reedy renamed this task from add <langcovnert> parser tag to add <langconvert> parser tag.Dec 16 2020, 2:50 AM
Johan added a subscriber: Johan.

Added to https://meta.wikimedia.org/wiki/Tech/News/2020/52 – please let me know if there are any mistakes in the text.

Added to https://meta.wikimedia.org/wiki/Tech/News/2020/52 – please let me know if there are any mistakes in the text.

Looks great, thanks!

Change 651011 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Parser test for Balinese language conversion

https://gerrit.wikimedia.org/r/651011

Change 636108 merged by jenkins-bot:
[mediawiki/core@master] Implement Balinese language converter

https://gerrit.wikimedia.org/r/636108

Please (also) support BCP 47 conform language codes like sr-Latn or sr-Cyrl instead of sr-el and sr-ec for the tag <langconvert>.

BCP 47 conform language codes are already needed for HTML attributes like the HTML attribute lang:

<span lang="sr-Cyrl">здраво</span>

Currently the HTML attribute lang requires a BCP 47 conform language code and the element langconvert requires the MediaWiki internal language codes:

<span lang="sr-Cyrl"><langconvert from="sr-el" to="sr-ec">zdravo</langconvert></span>

Better is to the use always the BCP 47 conform language codes:

<span lang="sr-Cyrl"><langconvert from="sr-Latn" to="sr-Cyrl">zdravo</langconvert></span>

Better is to the use always the BCP 47 conform language codes:

<span lang="sr-Cyrl"><langconvert from="sr-Latn" to="sr-Cyrl">zdravo</langconvert></span>

I agree that this would be better. Unfortunately SrConverter internally uses sr-ec and sr-el rather than BCP 47 compliant codes, so in order to identify the correct converter it must be via those codes. There is a mechanism for converting from sr-ec to sr-Cyrl but in this case we'd have to go the other way, and I'm not aware of any such conversion mechanism built into the core classes (Language, LanguageCode, LanguageFactory, LanguageConverter).

The correct fix is probably to change the internal codes in SrConverter and any other converters that use non-standard codes, but those codes are used in other places too like i18n so I assume that wouldn't be an easy change. I'm open to other ways to do this but I'd need to understand all of the implications and I don't know enough about the possible impacts to judge that well right now.

cscott added a subscriber: cscott.EditedSun, Jan 10, 3:20 PM

Can we open a new phab task for this? I apologize for not noticing/flagging this earlier. There are a number of tasks already in phab to deprecate and remove the old mediawiki codes (including sr-ec, sr-el, etc) and it would be a significant step backwards to have the old names written into article wikitext, which would require manually updating all that wikitext in the future.

I *think* there's already functionality to convert from BCP-47 codes to internal mediawiki ones, since that is needed by the REST APIs (for example) which use BCP-47 codes in their HTTP standard language request headers. But if not, I'm happy to write that function for use.

EDIT: LanguageConverter::validateVariant() is the existing method which accepts bcp-47 codes and converts them to internal codes. Ideally you'd have code like this:

$internalCode = $converter->validateVariant( $givenCode );
if ( LanguageCode::bcp47( $internalCode ) !== $givenCode ) {
   // error or warning or something, at least a tracking category
}

Just created a new task for this.