Page MenuHomePhabricator

Figure out how to handle language variants with maps
Open, LowPublic

Description

Some languages use multiple variants, often with different scripts, and use LanguageConverter to convert between them. For example, on Serbian Wikipedia, users can use either the Latin script or the Cyrillic script, with the interface translated separately, and the content transliterated automatically.

We should figure out how to make maps work well with this. We probably need to send a different language code to Kartotherian for Serbian depending on whether it's being viewed in Latin or Cyrillic. What we need to do here depends on how MediaWiki represents these things, and also on how OSM represents labels in these languages.

Event Timeline

Catrope created this task.May 3 2018, 10:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2018, 10:25 PM
MaxSem added a subscriber: MaxSem.EditedMay 3 2018, 10:54 PM

For the reference, here's how multiple variants are organized in OSM for Serbian. The formal rules are here (in Serbian), quick recap:

  • Local names in Serbia and name:sr are always Cyrillic.
  • Latin in name:sr-Latn
gis=# select count(*) as num from planet_osm_point where tags ? 'name:sr';
  num
-------
 14168

gis=# select count(*) as num from planet_osm_point where tags ? 'name:sr-Latn';
  num
-------
 12405

Daniel Koc, from the OSM talk mailing list, also mentions the following in this vein:

  • Belorusian has two writing systems (see http://openstreetmap.by for 4 languages demo).
  • Chineese has few different writing systems.
  • Buginese can use Lontara or Latin script.

Daniel Koc, from the OSM talk mailing list, also mentions the following in this vein:

It's not 4 writing systems. It's Russian, English and 2 variants of Belarussian grammar: Narakamauka (be) and Taraskievica (be-tarask). I wonder if there's even any difference between the latter 2 in practice.

I wonder if there's even any difference between the latter 2 in practice.

gis=# select count(*) from planet_osm_point where tags ? 'name:be-tarask' and tags ? 'name:be' and (tags->'name:be') <> (tags->'name:be-tarask');
 count
-------
   315

And some of these differences aren't even linguistical, like Аб'яднаныя Арабскія Эміраты vs. Аб’яднаныя Арабскія Эміраты. Another differences are related to Taraskievica using old, obsolete names for some places, e.g. Slavic Жыжмары instead of Жыежмарэй which reflects the town's present Lithuanian name Žiežmariai much better.

Pnorman added a subscriber: Pnorman.May 8 2018, 6:54 PM

Just to note, the IANA registry says

Type: language
Subtag: be
Description: Belarusian
Added: 2005-10-16
Suppress-Script: Cyrl

So be is the preferred way to say be-cyrl, with BCP 47 saying the be form is a SHOULD because Cyrl is a Surpress-Script

For Serbian, it states

Type: language
Subtag: sr
Description: Serbian
Added: 2005-10-16
Macrolanguage: sh
Comments: see cnr for Montenegrin
Type: language
Subtag: sh
Description: Serbo-Croatian
Added: 2005-10-16
Scope: macrolanguage
Comments: sr, hr, bs are preferred for most modern uses

BCP 47 doesn't help with which is preferred, so this would probably need to be resolved with CLDR.

There are two practical considerations

  1. What is the data we're actually dealing with? Max looked at that above.
  2. We're not doing a full BCP 47 implementation of language codes, let alone CLDR. Neither does MediaWiki.

As an example of what full BCP 47 consideration would involve, if someone wants en-GB-oed, the preferred value of that code is en-GB-oxendict, and they'd probably want en-GB if that doesn't exist, but would also accept en-GB-scotland, , en, or maybe en-CA-newfound. Babel doesn't handle any of this afaik, and I don't think Mediawiki does in full. But there isn't any need to do that, nor plans to do so.

Kocio added a subscriber: Kocio.Jun 2 2018, 12:29 AM
Restricted Application added a subscriber: Petar.petkovic. · View Herald TranscriptJun 2 2018, 12:30 AM
Vvjjkkii renamed this task from Figure out how to handle language variants with maps to kndaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii removed Catrope as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from kndaaaaaaa to Figure out how to handle language variants with maps.Jul 2 2018, 4:30 PM
CommunityTechBot assigned this task to Catrope.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Catrope removed Catrope as the assignee of this task.Jul 5 2018, 6:00 PM
MSantos triaged this task as Low priority.Sep 25 2018, 3:22 PM
MSantos moved this task from Unsorted to Feature requests on the Maps (Kartographer) board.