Page MenuHomePhabricator

Provide recommendations for Regional data
Closed, ResolvedPublic

Description

GDI team is putting together the country labels table with regional breakdown. It is currently stored in ntsako.country_meta_data which we are working to prepare for production.
Product Analytics will be providing recommendations for table structure, standard naming conventions, etc. This will inform the work in T313270

Event Timeline

kzimmerman triaged this task as Medium priority.Oct 11 2022, 5:11 PM
kzimmerman moved this task from Triage to Current Quarter on the Product-Analytics board.

Here are my recommendations after reviewing the data in ntsako.country_meta_data

Redundant fields
Consider removing the following fields if they aren’t of particular importance to any specific team or group of users

  • unesco_label: 35 missing countries
  • hdi_label: 59 missing countries
  • world_bank_label:
    • 34 missing countries
    • modified version of iban_label
  • imf_label
    • 53 missing countries
    • modified version of iban_label
  • insights_country_label
    • 5 missing countries
    • Same as either maxmind_label or country_area_label
  • official_un_area_label
    • 3 missing countries
    • Same as either maxmind_label or country_area_label

Missing country
Please consider populating the missing values

  • Channel Islands (not in either of the tables[ntsako.country_meta_data and canonical_data.countries]; pls remove from Sheet)
  • Heard Island and McDonald Islands (from canonical_data.countries, missing in ntsako.country_meta_data)
  • iban_label : 11 missing countries
  • grant_label: 125 missing countries
  • affiliate_label: 172 missing countries
  • m49_country_code: 3 missing countries
  • nine_ninety_region_class: 6 missing countries
  • There is one row where 5 out of 27 columns are Unclassified, and the rest are NULL [SELECT * from ntsako.country_meta_data WHERE country_area_label is NULL]

Rename fields

  • country_area_label + country_area_label → country_name [1]
  • un_continent_description → continent_name
  • un_subcontinent_description → subcontinent_name
  • maxmind_label → maxmind_country_name
  • global_north_or_south → global_market

Other suggestions

  • [1] canonical_country_label and country_area_label
    • There are some Incorrect values for countries in both columns. eg. Christmas Island and Cocos (Keeling) Islands have incorrect canonical_country_label. Suggest using a combination of canonical_country_label and country_area_label to create the new country_name column
  • Please add column descriptions and proper meanings for all the columns. Avoid expanding the column name and adding that as the description. For eg., nine_ninety_region_class : The 990 region classification of the country; does not help anyone understand the actual meaning of this column.

I am keeping this task open so we can re-work any of the suggestions.

Thank you @Mayakp.wiki for your careful review and input. Having reviewed and integrated most of the comments I recommend we move forward with productionizing our table minus added labels columns for potential agency country labels lookups and only post the binary indicator columns of whether the country should be found in those listings to reduce maintenance burden until there are other users identified.

Updating and defaulting to the canonical 'countries' table labels for country names here the suggested final columns labels and descriptions I now proposed include:

  • imf_country Binary indicator as to whether the country is a standard location among International Monetary Fund nations
  • pwt_country Binary indicator as to whether the country is a standard location among Penn World Table nations
  • un_country Binary indicator as to whether the country is a standard location among United Nations nations
  • world_bank_country Binary indicator as to whether the country is a standard location among World Bank nations
  • continent_name United Nations continent classification with the addition of Heard Island and McDonald Islands, Kosovo, and Taiwan
  • subcontinent_name United Nations subcontinent classification with the addition of Heard Island and McDonald Islands, Kosovo, and Taiwan
  • nine_ninety_region The regional classification for the annual 990 financial report which is publicly shared by the Foundation
  • grant_committee_region The regional classification of countries to regional Grantmaking committees
  • grants_country_label The country label assigned to this country in the Fluxx database
  • emerging_classification The 2017 Community Resources country classification of developed, emerging, and least developed Wikimedia communities.
  • wmf_market_research 2020 Foundation Market Research classifications
  • global_market Global market classification as Global North or Global South
  • wmf_region The current Wikimedia Foundation regional classification
  • country_code_iso_2 ISO 3166 2-alpha country code (aligned to country code of canonical 'countries' table)
  • country_code_iso_3 ISO 3166 3-alpha country code
  • country_code_m49 United Nations m49 numeric location code
  • canonical_country_name The country label aligned to canonical 'countries' table

Dropping the remaining columns to the GDI background only for now including:

  • HDI Labels
  • IBAN label
  • IMF Label
  • Maxmind label
  • Official UN Area Label
  • UNESCO Label
  • Affiliate Labels
  • Insights geo_country label
  • World Bank Labels

Please let me know if you see any remaining issues with labels or descriptions shared.

@ntsako, @JAnstee_WMF: here's an offhand thought, in case it's not too late.

In a lot of tables, we include "is" in field names to indicate that it's a binary field. For example, we just added is_protected to canonical_data.countries to indicated whether the country is on the country protection list. Similarly, mediawiki_history is full of these fields (page_is_redirect, user_is_anonymous, etc.).

I'd suggest doing the same here, so you'd have is_imf_country, is_pwt_country, is_un_country, and is_world_bank_country.

Resolving per discussion with @nshahquinn-wmf and @JAnstee_WMF
future requests will go to Movement-Insights board