Maniphest T371900

[Research Engineering Request] Productionize article-country data dependencies
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Isaac
	Aug 6 2024, 3:08 PM

Description

Goal

Airflow jobs to create two data dependencies for the article-country model (context: T371897).

Motivation

I am working on building an article-country model to be hosted on LiftWing. That model currently has two data dependencies that I have produced manually via a Jupyter notebook but I would like help into converting them into official Airflow jobs that we could run manually as needed to refresh the data. This would make updating the model once it's hosted on LiftWing far simpler. This isn't blocking anything at the moment, but ideally we would have it in place when we put the model on LiftWing (I'm aiming for late Q1 / early Q2) so we can refresh the dependencies then and put it in a stable, easy-to-maintain place.

Details

The two artifacts I would need are:

List of categories that map to countries via Wikidata. The code for this can be found in this notebook. It's quite straightforward -- just a set of simple transformations over the Wikidata entity table and the resulting artifact is just a few MB.
Frequence at which each country is linked to across all Wikipedia articles. This one is a bit more complicated but I think still pretty straightforward. In practice it has three stages:
- Compute which articles are associated with which countries based on Wikidata cultural properties (notebook) and Wikidata coordinates (notebook). This I think mostly duplicates the existing content gap metrics so hopefully we can just reuse those results? The only difference that I know of is that I take a more computationally-heavy but also slightly more accurate approach to assessing where an article's coordinates place it. I actually check which country the coordinates are located in whereas the current content gap approach finds the nearest city which could return incorrect predictions along country borders or for items that are in no country (e.g., Mariana Trench). I think it's probably okay to accept this source of error for the purposes of this data dependency.
- Use the category->country mapping produced as the first artifact to bulk compute which articles are associated with which countries based on their categories: https://github.com/geohci/wiki-region-groundtruth/blob/main/notebooks/01c-categories.ipynb#Compute-bulk-predictions
- Combine the data from these two tables with pagelinks to determine the rate at which each country is linked: https://github.com/geohci/wiki-region-groundtruth/blob/main/notebooks/analysis-wiki-region-tfidf.ipynb (note: I actually only used the Wikidata-based countries in this one but in reality it should use that and the category-based ones)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		Isaac	T366273 Article country model
		Open		None	T371900 [Research Engineering Request] Productionize article-country data dependencies

Event Timeline

Isaac created this task.Aug 6 2024, 3:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 6 2024, 3:08 PM

Isaac added a parent task: T366273: Article country model.Aug 6 2024, 3:09 PM

Isaac mentioned this in T369120: Determine evaluation strategy for article-country model.Aug 8 2024, 7:20 PM

Isaac added a project: Research.Aug 12 2024, 4:14 PM

KHernandez-WMF added a subscriber: XiaoXiao-WMF.Aug 13 2024, 5:42 PM

ldelench_wmf subscribed.Sep 9 2024, 4:35 PM

just a note from research engineering perspective - this task is part of the larger reproducible data pipeline discussion

fkaelin mentioned this in T377267: Consolidate article based data pipelines.Oct 15 2024, 8:51 PM

[Research Engineering Request] Productionize article-country data dependenciesOpen, Needs TriagePublicActions