Page MenuHomePhabricator

[Research Engineering Request] Productionize article-country data dependencies
Open, Needs TriagePublic

Description

Goal

Airflow jobs to create two data dependencies for the article-country model (context: T371897).

Motivation

I am working on building an article-country model to be hosted on LiftWing. That model currently has two data dependencies that I have produced manually via a Jupyter notebook but I would like help into converting them into official Airflow jobs that we could run manually as needed to refresh the data. This would make updating the model once it's hosted on LiftWing far simpler. This isn't blocking anything at the moment, but ideally we would have it in place when we put the model on LiftWing (I'm aiming for late Q1 / early Q2) so we can refresh the dependencies then and put it in a stable, easy-to-maintain place.

Details

The two artifacts I would need are:

  • List of categories that map to countries via Wikidata. The code for this can be found in this notebook. It's quite straightforward -- just a set of simple transformations over the Wikidata entity table and the resulting artifact is just a few MB.
  • Frequence at which each country is linked to across all Wikipedia articles. This one is a bit more complicated but I think still pretty straightforward. In practice it has three stages:
    • Compute which articles are associated with which countries based on Wikidata cultural properties (notebook) and Wikidata coordinates (notebook). This I think mostly duplicates the existing content gap metrics so hopefully we can just reuse those results? The only difference that I know of is that I take a more computationally-heavy but also slightly more accurate approach to assessing where an article's coordinates place it. I actually check which country the coordinates are located in whereas the current content gap approach finds the nearest city which could return incorrect predictions along country borders or for items that are in no country (e.g., Mariana Trench). I think it's probably okay to accept this source of error for the purposes of this data dependency.
    • Use the category->country mapping produced as the first artifact to bulk compute which articles are associated with which countries based on their categories: https://github.com/geohci/wiki-region-groundtruth/blob/main/notebooks/01c-categories.ipynb#Compute-bulk-predictions
    • Combine the data from these two tables with pagelinks to determine the rate at which each country is linked: https://github.com/geohci/wiki-region-groundtruth/blob/main/notebooks/analysis-wiki-region-tfidf.ipynb (note: I actually only used the Wikidata-based countries in this one but in reality it should use that and the category-based ones)