Page MenuHomePhabricator

[Research Engineering Request] Produce regular snapshots of all Wikipedia article topics
Open, MediumPublic

Description

Goal

Produce a regular (likely monthly) snapshot of all Wikipedia articles and their predicted topics from the language-agnostic article topic model. The snapshot should be available via HDFS (e.g., Hive table) though it would also be nice to release the dump publicly.

Why

For large-scale analyses of content / editing trends, aggregating data by topic is a useful approach to understand the underlying dynamics. We have a topic classification model available on LiftWing that can do this for any article on Wikipedia but APIs aren't a great fit for handling millions of requests (as might easily be the case). Making topic predictions for every article on Wikipedia is relatively simple, however, with access to the classification model and link data present on the HDFS cluster.

For example: T290042 and T351114

Engineering required

Likely an Airflow job that collects the input data (article links) for all articles and runs them through the model before saving them to the appropriate table or dump. The proposed monthly cadence is because that is how frequently the pagelinks/redirects tables are uploaded to Hive and they are core components of the pipeline for producing the input features for the model (Wikidata snapshots are also required but they happen at more frequently time intervals). Some details are provided in T290042#7326209 but if we assume that the model is not going to be retrained as part of this process, then the relevant links are:

  • Generate links data for each article: https://github.com/geohci/wikipedia-language-agnostic-topic-classification/blob/master/outlinks/01a_build_outlinks_data_cluster.ipynb
    • Note: this is older so probably should be updated slightly such as using the canonical_data.wikis table to narrow down just to Wikipedia articles. And the pagelinks table maybe has changed in format?
  • Bulk predict assuming a TSV of all the article links and trained model: https://github.com/geohci/wikipedia-language-agnostic-topic-classification/blob/master/utils/bulk_predict.py
    • Note: I can just provide the current model binary but an ideal process probably has a way to download it directly from where LiftWing stores the model binaries so it's clearly linked to a model version there.
    • Note: this builds a dense output -- i.e. the scores for every article and topic -- but the actual predictions are much more sparse and in practice you can probably just produce table with only the article+topic pairs that exceed a threshold of 0.15. We recommend folks use the threshold of 0.5 for assigning topics but the lower threshold of 0.15 gives some leeway for folks to adjust this if they'd like while still removing the vast majority of irrelevant article+topic pairs.

In theory, you might be able to also use the LiftWing eventstream for the model to update a snapshot as the Search platform does. In practice, however, this would also require you to track article deletions, moves, etc. too and so producing a monthly snapshot from scratch probably is the simplest path.

Event Timeline

Isaac renamed this task from Produce regular snapshots of all Wikipedia article topics to [Research Engineering Request] Produce regular snapshots of all Wikipedia article topics.Nov 20 2023, 6:44 PM

We reviewed this task in the backlog grooming meeting on November 21st. Given the limited capacity on the engineering front at this time and prioritization discussions (input by @fkaelin and @Miriam) we decided to prioritize T351674 instead. We will keep this task here as it is possible that we can pick it up in the coming 6 months. We will review it again in the future backlog grooming meetings.