Page MenuHomePhabricator

Figure out how to handle language variants with maps
Open, LowPublicFeature

Description

Some languages use multiple variants, often with different scripts, and use LanguageConverter to convert between them. For example, on Serbian Wikipedia, users can use either the Latin script or the Cyrillic script, with the interface translated separately, and the content transliterated automatically.

We should figure out how to make maps work well with this. We probably need to send a different language code to Kartotherian for Serbian depending on whether it's being viewed in Latin or Cyrillic. What we need to do here depends on how MediaWiki represents these things, and also on how OSM represents labels in these languages.

Event Timeline

For the reference, here's how multiple variants are organized in OSM for Serbian. The formal rules are here (in Serbian), quick recap:

  • Local names in Serbia and name:sr are always Cyrillic.
  • Latin in name:sr-Latn
gis=# select count(*) as num from planet_osm_point where tags ? 'name:sr';
  num
-------
 14168

gis=# select count(*) as num from planet_osm_point where tags ? 'name:sr-Latn';
  num
-------
 12405

Daniel Koc, from the OSM talk mailing list, also mentions the following in this vein:

  • Belorusian has two writing systems (see http://openstreetmap.by for 4 languages demo).
  • Chineese has few different writing systems.
  • Buginese can use Lontara or Latin script.

Daniel Koc, from the OSM talk mailing list, also mentions the following in this vein:

It's not 4 writing systems. It's Russian, English and 2 variants of Belarussian grammar: Narakamauka (be) and Taraskievica (be-tarask). I wonder if there's even any difference between the latter 2 in practice.

I wonder if there's even any difference between the latter 2 in practice.

gis=# select count(*) from planet_osm_point where tags ? 'name:be-tarask' and tags ? 'name:be' and (tags->'name:be') <> (tags->'name:be-tarask');
 count
-------
   315

And some of these differences aren't even linguistical, like Аб'яднаныя Арабскія Эміраты vs. Аб’яднаныя Арабскія Эміраты. Another differences are related to Taraskievica using old, obsolete names for some places, e.g. Slavic Жыжмары instead of Жыежмарэй which reflects the town's present Lithuanian name Žiežmariai much better.

Just to note, the IANA registry says

Type: language
Subtag: be
Description: Belarusian
Added: 2005-10-16
Suppress-Script: Cyrl

So be is the preferred way to say be-cyrl, with BCP 47 saying the be form is a SHOULD because Cyrl is a Surpress-Script

For Serbian, it states

Type: language
Subtag: sr
Description: Serbian
Added: 2005-10-16
Macrolanguage: sh
Comments: see cnr for Montenegrin
Type: language
Subtag: sh
Description: Serbo-Croatian
Added: 2005-10-16
Scope: macrolanguage
Comments: sr, hr, bs are preferred for most modern uses

BCP 47 doesn't help with which is preferred, so this would probably need to be resolved with CLDR.

There are two practical considerations

  1. What is the data we're actually dealing with? Max looked at that above.
  2. We're not doing a full BCP 47 implementation of language codes, let alone CLDR. Neither does MediaWiki.

As an example of what full BCP 47 consideration would involve, if someone wants en-GB-oed, the preferred value of that code is en-GB-oxendict, and they'd probably want en-GB if that doesn't exist, but would also accept en-GB-scotland, , en, or maybe en-CA-newfound. Babel doesn't handle any of this afaik, and I don't think Mediawiki does in full. But there isn't any need to do that, nor plans to do so.

Vvjjkkii renamed this task from Figure out how to handle language variants with maps to kndaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii removed Catrope as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from kndaaaaaaa to Figure out how to handle language variants with maps.Jul 2 2018, 4:30 PM
CommunityTechBot assigned this task to Catrope.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
MSantos moved this task from Unsorted to Feature requests on the Maps (Kartographer) board.
awight subscribed.

Exploring this for a minute. We've found that minor changes would enable this feature:

  • The mapdata library must wire the lang parameter through to uselang in the API
  • LanguageConverter will need to munge URLs appearing in Kartographer static map image tags.

Change 889980 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/core@master] [POC] Vary lang parameter for some URLs

https://gerrit.wikimedia.org/r/889980

Change 889981 had a related patch set uploaded (by Awight; author: Awight):

[mapdata@master] [WIP] Wire language through to MediaWiki

https://gerrit.wikimedia.org/r/889981

Change 889983 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/services/kartotherian@master] [WIP] Wire the language parameter through to mapdata

https://gerrit.wikimedia.org/r/889983

Change 889980 abandoned by Awight:

[mediawiki/core@master] [POC] Vary lang parameter for some URLs

Reason:

Happy to see this in the rear-view mirror!

https://gerrit.wikimedia.org/r/889980

The two attached patches seem to be enough. They both need tests and a bit of packaging attention, but nothing here seems controversial.

wait wait...

The lang is not the user interface language ? It is the requested content language.
So I might require a lang="en" map from a "uselang=zh-Hant" user interface... how would you solve that here ?

I mean it might work (I didn't fully check the patches), but we'd definitely have to add a bunch of comments everywhere to make sure everyone in the future fully understands what is going on.

The lang is not the user interface language ? It is the requested content language.
So I might require a lang="en" map from a "uselang=zh-Hant" user interface... how would you solve that here ?

I think you're right, my first suggestion to munge URLs in LanguageConverter was naive and definitely going nowhere fast. I've abandoned this direction.

The remaining patches simply wire img src lang= parameters through to the mapdata API, which won't result in any rendering differences in the snapshot. The lang parameter does control background tile label language which is unchanged here. The only reason we need this wiring is because the hashed group ID is unstable, and the wiring results in matching parser output that has the expected hash.

@TheDJ content vs. interface is more problematic than I'd hoped, and I found that there's already some confusing behavior.

For a page in language A, a map with no explicit language, and a user interface set to language C, the existing code renders like this:

  • Snapshot maps will have labels in language A.
  • When C is a variant of A which causes popup content changes, snapshots are broken.
  • Full-screen maps will open with labels in language A, but popups will be rendered using interface language C (not going to look different unless there are localizable tags in the content, or C is a language variant of A).

If the map has explicit label language B, then:

  • Snapshots have labels in language B.
  • Full-screen maps will have labels in language B and popups in language A.

Thanks for pointing out the issue!

Change 889981 merged by jenkins-bot:

[mapdata@master] Wire language through to MediaWiki

https://gerrit.wikimedia.org/r/889981

What is left to review here? My understanding is that the POC proposal for hashing pre-expansion content is ready to turn into its own follow-up task, but will not be prioritized for team work.

Change 889983 abandoned by WMDE-Fisch:

[mediawiki/services/kartotherian@master] Wire the language parameter through to mapdata

Reason:

Currently it's not clear what the right behavior should be and we want to avoid unwanted effects.

https://gerrit.wikimedia.org/r/889983

What is left to review here? My understanding is that the POC proposal for hashing pre-expansion content is ready to turn into its own follow-up task, but will not be prioritized for team work.

If nobody replies, go ahead? :)

Aklapper changed the subtype of this task from "Task" to "Feature Request".Nov 15 2023, 5:19 PM