Page MenuHomePhabricator

name:<local name code> is not always available in OSM
Closed, ResolvedPublic

Description

tl;dr In many cases, looking at a map with lang=<local lang> shows less labels (and more fallbacks, mostly english) than looking at the same map without the i18n feature.

Example
Localized map with lang=zh: https://maps.wikimedia.org/?s=osm-intl-i18n&lang=zh#13/39.9122/116.3925
Original map of the same area: https://maps.wikimedia.org/#13/39.9122/116.3925

The guideline from OSM seem to be to provide name=<local name> AND name:<local lang code>=<local name> but the latter is not always provided so we end up falling back on many imperfect options instead of showing the local name.

Since we don't have information about the language of the name: attribute I don't know how we can solve this. Maybe we can just accept that the data is not good enough but it feels like a regression when looking at a map with lang=<local lang>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think you meant name=<local name>, not name:<local name>. "name" should be the last fallback if other options are not available. "int_name" should also be considered (it's the most common tag besides the name itself -- over 400,000). As for telling the language of the "name", I proposed OSM to have regions with primary language code) - currently being discussed. Please participate.

I think you meant name=<local name>, not name:<local name>.

Yes, that's what I meant. Updated.

"name" should be the last fallback if other options are not available.

For the cases I'm describing here, "name" if by far the best fallback but we can't tell without knowing the lang of "name."

"int_name" should also be considered (it's the most common tag besides the name itself -- over 400,000).

I don't see how "int_name" can be used if we don't know which lang it is written in.

As for telling the language of the "name", I proposed OSM to have regions with primary language code) - currently being discussed. Please participate.

Is there a wiki page for such proposals? There was about about having name:language=<language used in the name tag> but it was rejected. Even if accepted the data doesn't appear overnight but I thought it was a decent solution.

I looked at some numbers. For China*, there are 29332 named cities or towns in OSM. 7508 have no name:* tags, and 16288 have a name:zh. The labels for these would not appear in a different language for a Chinese map of China. This leaves 5536 which have name, a name:*, but not name:zh. If I restrict this to just cities, the proportion which have name:* but not name:zh decreases.

  • Specifically, Geofabrik's extract for China, which covers a slightly bigger area.

SQL to generate the results from an osm2pgsql database is below

SELECT
    *,
    "Total named cities and towns" - "Without any names but name=*" - "Has Chinese Han name specified" AS "Would switch language"
  FROM (SELECT
    COUNT(*) AS "Total named cities and towns",
    COUNT(*) FILTER (WHERE extract_names(tags) IS NULL) AS "Without any names but name=*",
    COUNT(*) FILTER (WHERE extract_names(tags) ? 'zh') AS "Has Chinese Han name specified"
  FROM planet_osm_point
  WHERE
    name IS NOT NULL
    AND place in ('city', 'town')) _;

@SBisson I think adding an extra tag to every single OSM object with the name tag is a bit excessive - there are 62 million of them. A much better solution IMO is to make it possible to calculate that language based on the geo position of the object. My suggestion was to create a new "language meta-regions" with the language tags. Another solution is to add language tags to the existing regions. Both have pros/cons, but so far it was only discussed in that mailing thread (link above). If either of these solutions are implemented, it should be possible to generate a geo index for lookups - something that can be done during vtile data generation.

In T192662#4147291, @Pnorman wrote:

I looked at some numbers. For China*, there are 29332 named cities or towns in OSM. 7508 have no name:* tags, and 16288 have a name:zh. The labels for these would not appear in a different language for a Chinese map of China. This leaves 5536 which have name, a name:*, but not name:zh. If I restrict this to just cities, the proportion which have name:* but not name:zh decreases.

Thanks Paul. In Stephane's examples above (internationalized vs. current), it looks like what's missing here is not the city name but many street and district names. Is there a way to sample data for those?

Also, can you show us the USA as well?

Thanks!

For Chinese highways

│ Total named highways           │ 412114 │
│ Without any names but name=*   │ 247410 │
│ Has Chinese Han name specified │ 50415  │
│ Would switch language          │ 114289 │

So 18% of city and town labels would switch languages and 28% of road labels. I'll need to load the US data, so I can post that this evening.

For the Geofabrik US West region

│ Total named highways             │ 2223811 │
│ Without any names but name=*     │ 2215127 │
│ Has English latin name specified │ 270     │
│ Would switch language            │ 8414    │
│ Total named cities and towns     │ 1236 │
│ Without any names but name=*     │ 1025 │
│ Has English latin name specified │ 63   │
│ Would switch language            │ 148  │

Joe asked for my comments here, although I'm a bit lost, because I'm really not an OSM expert. The big thing that I fail to understand is how can it happen that some labels are not shown at all. In practice, this is indeed a regression, but isn't there supposed to be some kind of a fallback?

The big thing that I fail to understand is how can it happen that some labels are not shown at all. In practice, this is indeed a regression, but isn't there supposed to be some kind of a fallback?

With the i18n changes we'd never fail to show a label that is currently shown. Different labels may show because labels are different sizes. For example, a Chinese label will generally take less space than an English one, so fewer English labels can be shown.

Joe asked for my comments here, although I'm a bit lost, because I'm really not an OSM expert. The big thing that I fail to understand is how can it happen that some labels are not shown at all. In practice, this is indeed a regression, but isn't there supposed to be some kind of a fallback?

There's a series of fallbacks, explained thoroughly here: T192701: Create an optimized language-fallback system for Maps internationalization based on investigations

The big thing that I fail to understand is how can it happen that some labels are not shown at all. In practice, this is indeed a regression, but isn't there supposed to be some kind of a fallback?

With the i18n changes we'd never fail to show a label that is currently shown. Different labels may show because labels are different sizes. For example, a Chinese label will generally take less space than an English one, so fewer English labels can be shown.

The only time a label won't be shown is if it doesn't, at all, have a local value (name=) which, as I understand it, doesn't happen.

This isn't really a regression; the previous code would supposedly show a random language instead if there were no fallbacks at all and no local language, which is really not a very good result.

As I understand, there should never be labels that have no local values, though. Is that right, @Pnorman ?

SBisson claimed this task.

This task contains some interesting discussion but is not actionable. It was superseded by T192701: Create an optimized language-fallback system for Maps internationalization based on investigations