Page MenuHomePhabricator

Add Link engineering: Script for updating datasets used by the mwaddlink service
Closed, ResolvedPublic

Description

mwaddlink preprocesses standard wiki data like links tables into datasets used by the recommender service. Preprocessing happens on the stats servers but the live service should only communicate with the production servers, so the datasets need to be moved; see T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server for more detail. Similarly, the datasets need to be copied when setting up the service in a local development environment. They are published via the web, so accessing them is easy. Accessing the production database is less easy and MediaWiki maintenance scripts can already do it, so that's the easiest approach.

We need a maintenance script that:

  • is able to detect when the datasets have changed (probably by means of the hashes published along with the full files)
  • downloads the datasets from https://https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ (which are large files so maybe support resuming downloads)
  • loads the datasets into the specified cluster/DB, presumably with some kind of locking mechanism as the service can't operate while the data is half-loaded

Event Timeline

This is soft-blocked on T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server (publishing the datasets) and on determining / setting up the target database.

In addition, while this would mostly involve importing tables, the link model is a JSON file, its contents need to be written with an INSERT into the lr_model table. There is some discussion of that here.

Change 660334 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] load-datasets: Allow download/import of all published datasets

https://gerrit.wikimedia.org/r/660334

kostajh renamed this task from Add Link engineering: Maintenance script for updating datasets used by the mwaddlink service to Add Link engineering: Script for updating datasets used by the mwaddlink service.Feb 11 2021, 1:48 PM
kostajh claimed this task.

Change 660334 merged by jenkins-bot:
[research/mwaddlink@main] load-datasets: Allow download/import of all published datasets

https://gerrit.wikimedia.org/r/660334

kostajh moved this task from Code Review to QA on the Growth-Team (Current Sprint) board.

The script is merged and is working in production \o/ I'm resolving this.