Page MenuHomePhabricator

Create another region mapping for geographic gap
Closed, ResolvedPublic

Description

It would be great to have articles mapped to the four main grant regions:

  • Middle East and Africa
  • South Asia
  • East, Southeast Asia, and Pacific (ESEAP)
  • Latin America and The Caribbean
  • United States and Canada
  • Northern and Western Europe
  • Central and Eastern Europe (CEE) and Central Asia

Mappings are on hive at ntsako.country_meta_data (we should map our country_code_iso_2 country codes to the regions in the column wmf_region)

Thanks!

Event Timeline

Interesting, I wasn't aware of this dataset on hive.

The knowledge gaps pipeline depends on https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes to to map country_name, country_code, continent, sub_continent (code). It would be preferable to use a single dependency for mappings of geographic entities, and if there is a wmf based source that would be even better. Can we use this table as authoritative source? I.e. if there are changes to the countries or mappings, will it be updated?

Looking at the schema of the table, it would contain all the information needed and more, including the wmf_region that you are suggesting to add.

I didn't check on the level of the country, there is a difference in how the "sub continents" are defined (the wmf version seems better). The wmf_region seem derived from the sub continents

{'Central & Eastern Europe & Central Asia',
 'East, Southeast Asia, & Pacific',
 'Latin America & Caribbean',
 'Middle East & North Africa',
 'North America',
 'Northern & Western Europe',
 'South Asia',
 'Sub-Saharan Africa',
 'UNCLASSED'}

At the moment the content gap metrics for geography are geographic_sub_continent, geographic_continent, geographic (which is country based on a reverse lookup of lat/lon, we should rename it), geographic_region (which is based on the cultural model using these properties, we should rename it). @Miriam, my concern is that too many geographic content gaps are detrimental - given the request for using wmf_region, do you have a recommendation for which geography gaps we should compute?

My recommendation: geography_country based on the reverse geo lookup of the P625 property, geography_region for the wmf region derived from country, geographic_continent derived from country. Remove geographic_region content gap metric (the feature for the cultural model are still generated and stored, we could use it in the future).
,

Hi Fabian, yes I totally agree with your recommendation. Thank you so much for this and sorry I missed this comment!

leila triaged this task as Medium priority.Apr 4 2023, 7:53 PM
leila moved this task from Backlog to In Progress on the Research board.

The code for this has been merged, a knowledge pipeline is running and results should be available early next week. The linked merge request has some plots for the new 'wmf regions' gap hat look reasonable.

The pipeline now uses ntsako.country_meta_data for all geography entity mapping, we should discuss where this data should be stored officially and a process for how to update it when necessary.