Page MenuHomePhabricator

[Research Engineering Request] Productionize article-country data dependencies
Open, Needs TriagePublic

Description

Goal

Airflow jobs to create two data dependencies for the article-country model (context: T371897).

Motivation

I am working on building an article-country model to be hosted on LiftWing. That model currently has two data dependencies that I have produced manually via a Jupyter notebook but I would like help into converting them into official Airflow jobs that we could run manually as needed to refresh the data. This would make updating the model once it's hosted on LiftWing far simpler. This isn't blocking anything at the moment, but ideally we would have it in place when we put the model on LiftWing (I'm aiming for late Q1 / early Q2) so we can refresh the dependencies then and put it in a stable, easy-to-maintain place.

Details

The two artifacts I would need are:

  • List of categories that map to countries via Wikidata. The code for this can be found in this notebook. It's quite straightforward -- just a set of simple transformations over the Wikidata entity table and the resulting artifact is just a few MB.
  • Frequence at which each country is linked to across all Wikipedia articles. This one is a bit more complicated but I think still pretty straightforward. In practice it has three stages:
    • Compute which articles are associated with which countries based on Wikidata cultural properties (notebook) and Wikidata coordinates (notebook). This I think mostly duplicates the existing content gap metrics so hopefully we can just reuse those results? The only difference that I know of is that I take a more computationally-heavy but also slightly more accurate approach to assessing where an article's coordinates place it. I actually check which country the coordinates are located in whereas the current content gap approach finds the nearest city which could return incorrect predictions along country borders or for items that are in no country (e.g., Mariana Trench). I think it's probably okay to accept this source of error for the purposes of this data dependency.
    • Use the category->country mapping produced as the first artifact to bulk compute which articles are associated with which countries based on their categories: https://github.com/geohci/wiki-region-groundtruth/blob/main/notebooks/01c-categories.ipynb#Compute-bulk-predictions
    • Combine the data from these two tables with pagelinks to determine the rate at which each country is linked: https://github.com/geohci/wiki-region-groundtruth/blob/main/notebooks/analysis-wiki-region-tfidf.ipynb (note: I actually only used the Wikidata-based countries in this one but in reality it should use that and the category-based ones)

Event Timeline

just a note from research engineering perspective - this task is part of the larger reproducible data pipeline discussion

@Isaac is this task still valid? Does T361637 has a dependency on this?

is this task still valid?

Yes - the articlecountry model is on LiftWing but the two dependencies listed in this task are static and have no official way of updating beyond re-running my Jupyter notebooks. Additionally, T387041: Generate Airflow DAG for creating article-country SQLite DB lists a third dependency that is also still valid. I don't know if this would fall under REng or ML at this point though.

Does T361637 has a dependency on this?

Depends on the priority put on long-term maintenance. The model is working and updating these dependencies is not urgent but eventually there will be a need to refresh (as has happened with add-a-link) and at that point, these will be necessary. Personally I think good to establish as Airflow jobs now so the code/knowledge is not lost and because the model is being used so it's clearly moved out of an experimental state.