Page MenuHomePhabricator

Standardize usage of geographic entities for knowledge gaps
Closed, ResolvedPublic

Description

The table gdi.country_meta_data is the source of information regarding geographical metadata. In particular, it maps countries to the "wmf_region", which is often used in reporting.

However, research has worked with geographical models before that table was created, and uses a different set of regions defined here. The differences between these two datasets are:

as defined by research, without a match in the gdi.country_meta_data:

image.png (571×833 px, 76 KB)

as defined by gdi.country_meta_data, but without a match in the research base regions

image.png (627×835 px, 121 KB)

The purpose of this task is to track the consolidation / alignment of the base regions definition and the gdi country metadata.

Motivating use case: the calculation of intersections between e.g. the gende and geography gaps (T336766).

The geography gap (available on the country and wmf_region level) is using a geospatial model, which uses lat/lon coordinates from the P625 wikidata property to reverse geo code. However, the overlap between articles associated with lat/lon coordinates and articles about humans is almost zero since people are not generally associated with coordinates. Instead, there is "cultural" geography model which makes use of properties associated with countries, which currently is mapped to a named geographic entity using a this mapping file. However, the issue is that the base regions (which are mostly countries) currently can't be mapped to the "source of truth" for geographical data at the wmf (gdi.country_meta_data), and in particular to the "wmf_region" which are commonly used in the reports.

Details

Due Date
Mar 29 2024, 6:00 AM

Event Timeline

KHernandez-WMF triaged this task as Medium priority.
KHernandez-WMF set Due Date to Mar 29 2024, 6:00 AM.
fkaelin changed the task status from Open to In Progress.Jan 23 2024, 2:29 PM

This work is done with this MR, which migrated the KG pipeline to using the canonical_data.countries table which now includes the wikidata qid of the country, which allowed to replace the base region mapping file. For the cultural gap in particular, the "re-mapping" of some territories not in the canonical countries table was retained to expand the coverage of the gap.