Page MenuHomePhabricator

Investigate: Retrieve translated / localized place and country names from Wikidata
Open, In Progress, Needs TriagePublicSpike

Description

Problem
The MaxMind database is not updated very often (the version we have on production is from 2018) and the coverage of translations for place names is rather slim to begin wtih.

Proposed Solution
Convert the GeoNames ID & ASN provided by MaxMind into Wikidata entities.

This can be done with a search like this for GeoName ID:
https://www.wikidata.org/w/index.php?search=haswbstatement%3AP1566%3D4140963&title=Special:Search
or for ASN:
https://www.wikidata.org/w/index.php?search=haswbstatement%3AP3797%3D7922&title=Special%3ASearch

Ideally, for performance reasons, the search should be performed directly on elasticsearch rather than using MediaWiki's API.

Event Timeline

Problem
The MaxMind database is not updated very often (the version we have on production is from 2018) and the coverage of translations for place names is rather slim to begin wtih.

Despite now using the MaxMind Enterprise database, the coverage of translations still appears to be slim. Quoting https://support.maxmind.com/geoip-faq/specifications-and-implementation/what-languages-does-geoip2-support/:

GeoIP2 products can return localized place names in Brazilian Portuguese (pt-BR), English (en), French (fr), German (de), Japanese (ja), Russian (ru), Simplified Chinese (zh-CN), and Spanish (es).

The above said, I can't help but feel this isn't our issue to solve. Text in at least the main and property namespaces on Wikidata are made available under the CC0 license. We should encourage (or even help!) MaxMind to improve their coverage using Wikidata using the queries that @dbarratt so helpfully included in the task description.

Prtksxna renamed this task from Retrieve translated / localized place and company names from Wikidata to Retrieve translated / localized place and countory names from Wikidata.Aug 8 2022, 4:00 PM
Prtksxna renamed this task from Retrieve translated / localized place and countory names from Wikidata to Retrieve translated / localized place and country names from Wikidata.
STran renamed this task from Retrieve translated / localized place and country names from Wikidata to Investigate: Retrieve translated / localized place and country names from Wikidata.Aug 15 2022, 6:29 PM
STran claimed this task.
STran moved this task from IP Info to Cards ready for development on the Anti-Harassment board.

It's been years since this ticket was written and I don't have context into why WikiData was the preferred solution. I'll cover it and an alternative as well. This is a bit long and not every step may be necessary.

tl;dr

If we don't care about updating the data, ignore everything except 4.1 and 4.3 and pick from one of those.

@Prtksxna:

  1. There's a proposal in the original ticket to convert data we have into WikiData. How much do we value possibly updating or improving the data? Some of this already exists as WikiData but I can't guarantee it all does. Once the GeoName data exists in some form that interfaces with our own translation framework, we can improve translations ourselves.
  2. This ticket mentions both location and ASN data but from what I can tell, there isn't a localization concern for ASNs (by which I mean MaxMind doesn't localize them). Do you think we need to be concerned with making ASNs localized as well?
  3. How robust do you want translations to be? It seems unreasonable to expect everything to be exhaustively translated before it's ever used but should we be looking into ways to flag where we might need translations of countries/locations that come up?

My personal opinion is that:

  • The lowest cost solution is to just localize using whatever MaxMind returns. It's not perfect and whatever MaxMind translates is what we get. We should consider doing this regardless of anything else on top because IPInfo is technically an extension available to everyone and not everyone has a WikiData instance to call.
  • We can move forward with localizing via WikiData if we think what's already there is good enough. The infrastructure exists for every component of this ask.
  • If we want to import additional data, I think that should be done in parallel while we build out the functionality in IPInfo.
  • It's possible, depending on how robust we want this and how much effort we want to put into this, to use whatever WikiData has alongside an integration of Maxmind's own localization as a fallback. Subsequently importing the data into WikiData will make the MaxMind fallback redundant but I suspect this would be faster than working through an import and could be a viable stopgap.
  • I think the i18n solution is kind of janky but would avoid a call to an external server (WikiData) and therefore (theoretically) and is therefore more performant. I think if we want to move forward with this, we definitely need to think hard about ways to avoid unnecessary bloat before any kind of implementation. I think doing this "right" is very janky.

For reference, here's an example IP lookup (pulled from a MaxMind Enterprise db and edited down). When I need to use a specific example, unless otherwise stated, I'll be referring to this:

[
    {
        "Records": [
            {
                "Record": {
                    "city": {
                        "geoname_id": 3530597,
                        "names": {
                            "de": "Mexiko-Stadt",
                            "en": "Mexico City",
                            "es": "Ciudad de México",
                            "fr": "Mexico",
                            "ja": "メキシコシティ",
                            "pt-BR": "Cidade do México",
                            "ru": "Мехико",
                            "zh-CN": "墨西哥城"
                        }
                    },
                    "continent": {
                        "code": "NA",
                        "geoname_id": 6255149,
                        "names": {
                            "de": "Nordamerika",
                            "en": "North America",
                            "es": "Norteamérica",
                            "fr": "Amérique du Nord",
                            "ja": "北アメリカ",
                            "pt-BR": "América do Norte",
                            "ru": "Северная Америка",
                            "zh-CN": "北美洲"
                        }
                    },
                    "country": {
                        "geoname_id": 3996063,
                        "iso_code": "MX",
                        "names": {
                            "de": "Mexiko",
                            "en": "Mexico",
                            "es": "México",
                            "fr": "Mexique",
                            "ja": "メキシコ合衆国",
                            "pt-BR": "México",
                            "ru": "Мексика",
                            "zh-CN": "墨西哥"
                        }
                    },
                    "postal": {
                        "code": 00000
                    },
                    "registered_country": {
                        "geoname_id": 6252001,
                        "iso_code": "US",
                        "names": {
                            "de": "Vereinigte Staaten",
                            "en": "United States",
                            "es": "Estados Unidos",
                            "fr": "États Unis",
                            "ja": "アメリカ",
                            "pt-BR": "EUA",
                            "ru": "США",
                            "zh-CN": "美国"
                        }
                    },
                    "subdivisions": [
                        {
                            "geoname_id": 3527646,
                            "iso_code": "CMX",
                            "names": {
                                "en": "Mexico City",
                                "es": "Ciudad de México",
                                "zh-CN": "墨西哥城市"
                            }
                        }
                    ],
                    "traits": {
                        "autonomous_system_number": 20473,
                        "autonomous_system_organization": "AS-CHOOPA",
                        "connection_type": "Corporate",
                        "domain": "vultrusercontent.com",
                        "isp": "Choopa, LLC",
                        "organization": "Choopa, LLC",
                        "user_type": "hosting"
                    }
                }
            }
        ],
    }
]

The task can be broken down into a few steps:

1. Get the data

As of writing, there isn't a database we control that has the full list of GeoName IDs or ASNs. The data is available online - GeoNames data is readily available under creative commons whereas we might need to put in some work to source ASNs from a source with licensing we want.

For context, GeoNames is a crowdsourced database with "over 25,000,000 geographical names corresponding to over 11,800,000 unique features[1]." GeoName IDs can be found from https://www.geonames.org/. They provide txt files and we're probably interested in allCountries.zip and cities500.zip [2]

A quick search into ASNs suggests we need to ask registrars specifically for lists. Per IANA, "AS Numbers can be obtained from the registry in your region." [3] I did however find a list online here: https://www.bgplookingglass.com/list-of-autonomous-system-numbers that seems credible but I'm not sure about how up to date it is, its licensing, or its terms of use. Additionally, I'm not sure we even need to get ASN data from somewhere else unless we think MaxMind isn't updating ASNs quickly enough. ASNs don't seem to have the same localization concern we have with location data.

[1] https://en.wikipedia.org/wiki/GeoNames
[2] https://download.geonames.org/export/dump/readme.txt
[3] https://www.iana.org/assignments/as-numbers/as-numbers.xhtml


2. Store the data somewhere

For privacy and uptime reasons, it makes sense to maintain control over our own store of this data. If we don't, we have to get access to a third-party API and call that. I'm going to assume we're not doing that and instead continue this investigation as if storing the data is a requirement.

Solution 2.1: WikiData

This is what's proposed in the original ticket. I don't have a local cluster set up to test CirrusSearch so I'm going to assume that the curl results from WikiData are similar to what we can expect from using the extension. From what I understand, CirrusSearch is a wrapper around curling elastisearch so that we don't have to traverse up and down the stack in the extension [4]. Accessing Mexico City's json format (https://www.wikidata.org/wiki/Special:EntityData/Q1489.json) gives me access to its localization data:

(abridged)

"labels": {
  "fr": {
    "language": "fr",
    "value": "Mexico"
  },
  "it": {
    "language": "it",
    "value": "Città del Messico"
  },
  "de": {
    "language": "de",
    "value": "Mexiko-Stadt"
  },

[4] https://www.mediawiki.org/wiki/Extension:CirrusSearch

Solution 2.2: i18n json files

Alternatively, the data can be stored in qqq.json and lang.json files. For instance:

en.json

{
  "geoname-id-3527646": "Mexico City"
}

qqq.json

{
  "geoname-id-3527646": "Name of the city associated with the id in the key, Mexico City"
}

It can then be accessed through our msg utilities in both JS and PHP.


3. Importing new data

As of writing, WikiData entities aren't exhaustive. Mexico City exists, but this random city I pulled from AD.txt doesn't (https://www.wikidata.org/w/index.php?search=haswbstatement%3AP1566%3D3039177&ns0=1&ns120=1). Granted, I don't think we need to import all 11M which seems a bit much, given that the vast majority of them aren't needed by MaxMind's dataset. We should consider 1. how much data to import and 2. how often we plan to update the data.

Programmatically importing 10 countries should (ideally) be no different than programmatically importing 100000 countries. Considering there are apparently 11M+ GeoName entities, we might consider only importing data on demand.

Pros: We don't bloat the data store we use. Translations can be targeted.
Cons: Translations won't happen until after the import and therefore almost certainly after the first/needed use of the data. Although, reasonably, one also does not expect translations to have already happened preemptively so perhaps this is just something we'll have to work with.

Solution 3.1: Importing to WikiData

https://www.wikidata.org/wiki/Wikidata:Data_Import_Guide
It exists but it looks like it requires quite a bit of work. This makes sense if we have a targeted database of information to import in one go. It makes less sense if we want to try and import data as we need it.

Pros: The data will exist as WikiData for use elsewhere, there seems to be a formalized (if somewhat complicated) method and community support for doing so
Cons: I have no idea how updating will work

Solution 3.2: Importing to an i18n file

In brief, we would probably need a script that does the following:

  • aggregate GeoName IDs and country/city names from the MaxMind db
  • pipe it to a json file (at the very least, create en.json and qqq.json but other json files can be created as well, given that MaxMind does contain some localization. This may run afoul of licensing? But if it's GeoName data maybe it's okay?)

Pros: Updates can be differential
Cons: Kind of a weird way to use the translation system, the json files would have to bloat something (possibly WikimediaMessages? Adding all the lines to the core en.json file in IPInfo would make it unusable. It's a suboptimal dependency and that alone might make this a non-starter of a solution.)

Solution 3.3: Just use whatever MaxMind has

MaxMind has some localization. If it's good enough, we do know the user's language at the time we're accessing the DB and can use that to determine what name to return.


4. Get the data from the data store via IPInfo

Regardless of which data store is used, IPInfo is most likely going to have to grab it when the API call hits and pipe the translated message to the front-end. It seems like best practice would be to pass a key along and let the front-end JS deal with parsing it, but iirc, the resource loader asks you to specify what strings you expect to use and we don't know that at the time the API call kicks off (well, we know that one of the possibly thousands of strings will be used and it would be unreasonable to import all of them "just in case").

4.1 From WikiData:
  1. Use CirrusSearch to get WikiData, which should return something JSON-y with localized country/city names. From what I can tell, you will have to search for the GeoName ID existing in the entity.
  2. Get the user's language preference at the time of the API call
  3. Pass back the localized location

Something to note here is that this could make CirrusSearch a dependency. I'm not sure we should expect the hypothetical other users of IPInfo to want this dependency.

4.2 From an i18n file:
  1. Use the GeoName ID, which presumably matches up to a key, and let the msg utilities handle parsing it
  2. Pass back the localized location
4.3 Use MaxMind's localization:
  1. Get the user's language preference at the time of the API call
  2. Try to find it in the record (double check to make sure our language codes match up with MaxMind's. I'm not sure if something like zh-CH does)
  3. Pass back the localized location

In all cases, we should always default back to MaxMind's en string, since it's never guaranteed that any of the localized data will exist.

Thanks @STran!

There's a proposal in the original ticket to convert data we have into WikiData. How much do we value possibly updating or improving the data? Some of this already exists as WikiData but I can't guarantee it all does. Once the GeoName data exists in some form that interfaces with our own translation framework, we can improve translations ourselves.

What do you mean by interfaces with our own translation framework. Is that something like 4.2, or something else?

This ticket mentions both location and ASN data but from what I can tell, there isn't a localization concern for ASNs (by which I mean MaxMind doesn't localize them). Do you think we need to be concerned with making ASNs localized as well?

I had forgotten that ASN was part of this ticket. Yeah, not sure if we want to do this right now. Don't we only show the ASN number anyway?
Also, what do you mean there is no localization concern, because MaxMind doesn't localize them? I don't follow.

How robust do you want translations to be? It seems unreasonable to expect everything to be exhaustively translated before it's ever used but should we be looking into ways to flag where we might need translations of countries/locations that come up?

We definitely don't need to have everything before we start. I'd say even very little with a way to improve (as you mentioned) is better than what we have right now.


The lowest cost solution is to just localize using whatever MaxMind returns. It's not perfect and whatever MaxMind translates is what we get. We should consider doing this regardless of anything else on top because IPInfo is technically an extension available to everyone and not everyone has a WikiData instance to call.

I agree. Let us do this to begin with. Would it be possible to use the language fallback chains here too and show the language closest to what the users' preference is?

What do you mean by interfaces with our own translation framework. Is that something like 4.2, or something else?

I mean this to generally refer to data we control and provide updates to. WikiData or a json file or something else.

I had forgotten that ASN was part of this ticket. Yeah, not sure if we want to do this right now. Don't we only show the ASN number anyway?

ASN stands for Autonomous System Number and the number belongs to an organization, which we also show.

Also, what do you mean there is no localization concern, because MaxMind doesn't localize them? I don't follow.

Ah my mistake was referring to the ASN instead of the organization. MaxMind doesn't localize organization names. "AS-CHOOPA" is always going to be "AS-CHOOPA."

Prtksxna changed the subtype of this task from "Task" to "Spike".Sep 24 2022, 5:30 AM

Thanks @STran!

Oh yeah, I didn't realize we were showing the organization name too. I don't think we need to be concerned about the ASN name right now.

So we've completed 4.3 in T316665: Use MaxMind translations in country and location data. I think there is value in doing 4.3 because:

  • We'll at least get country names in many more languages than MaxMind data
  • We'll have a fallback when MaxMind (or a different service) doesn't have translations
  • We get better translations as Wikidata improves

However, I when and whether we prioritize it would depend on how big this is. As a next step, could we spec this out (or make a task to spec it out) and keep that in our backlog? Do you think it is worth talking to folks in the Language team about the solution we're thinking of?

Prtksxna moved this task from Closed to In Progress on the IP Info board.

Thanks for talking a quick look at this ticket @santhosh! Could you please evaluate the different approaches mentioned in T266273#8184182 and see if the Wikidata approach would make sense for our use case.

Also, you mentioned another approach in our Slack messages. Could you please elaborate it here?

(reopening the ticket while collecting feedback)

Hi @Prtksxna , We discussed this in language team.

I was mentioning CLDR as another place where place name localizations exist. Among the options, we do not think localizing using our i18n system is appropriate here. The workflow of translatewiki.net and localization messages are not intended for this usecase. For this kind of data we have an extension named CLDR which takes data from CLDR releases and integrate to mediawiki. We ask community members to join CLDR data contribution website and do localization there so that later we can use it in mediawiki. This is done for localized language names for example.

For your usecase, the CLDR data will be insufficient, it may be the smallest data set among the options you already considered. And contributing to CLDR is not easy (access permission, data vetting protocol, release delays). Wikidata would be the ideal choice here. Even then, you should be prepared for a reasonable fallback when it is missing in wikidata.