mwaddlink preprocesses standard wiki data like links tables into datasets used by the recommender service. Preprocessing happens on the stats servers but the live service should only communicate with the production servers, so the datasets need to be moved; see T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server for more detail. Similarly, the datasets need to be copied when setting up the service in a local development environment. They are published via the web, so accessing them is easy. Accessing the production database is less easy and MediaWiki maintenance scripts can already do it, so that's the easiest approach.
We need a maintenance script that:
- is able to detect when the datasets have changed (probably by means of the hashes published along with the full files)
- downloads the datasets from https://https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ (which are large files so maybe support resuming downloads)
- loads the datasets into the specified cluster/DB, presumably with some kind of locking mechanism as the service can't operate while the data is half-loaded