# Goal
The goal would be to adapt the airflow-dag for training the addalink model in such a way that it generates the output in a format/location such that it can be readily used in the deployed add-a-link model. The `publish-datasets.sh` part could potentially be part of the airflow-dag or is executed separately after all files were generated correctly in the airflow-dag pipeline.
# Details
We have moved the training pipeline for the addalink model to an airflow-dag T361926. However, the output of the training pipeline is not ready for use in production. Specifically, the airflow-dag saves all output as tables in HDFS. In contrast, for the trained model to be used in production, the models need to be available in a specific format (mostly pickled dictionaries are [[ https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L62 | saved as sqlite files ]] and then copied into [[ https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L65 | MySQL tables ]] ). After that the sqlite-files are published with [[ https://github.com/wikimedia/research-mwaddlink/blob/main/publish-datasets.sh | this script ]] where they can be used in production.
# Motivation
This work would substantially improve maintenance of the training pipeline for the add-a-link model.
* There is currently no established process for updating the models for the add-a-link model T276438. In fact, it is still done manually for every language in one of the stat-hosts meaning it is very inefficient.
* The lack of such a process is causing additional work in current efforts to retrain the models T385780. (see, e.g., T385781 or T387556)
* The updated airflow-dag would substantially reduce the maintenance cost not only for training 100s of models in different languages but also regularly updating/re-training the models (say every month or so).