Once models are in production in liftwing, what should the process be for updating a model? This is assuming no code changes in the inference service are required.
Current workflow is a manual workflow:
- a researcher creates a model artifact, e.g. in a jupyter notebook
- a phab task (example for the revert risk model: T360423) is used to track the re-deployment of the model to liftwing, done by a ml engineer
Research is moving towards making model training reproducible,
- there will be a pipeline that produces a model artifact
- model retraining pipelines can scheduled (e.g. produce a new model artifact every month)
- model training can be triggered, e.g. detect model drift, manually
- as the number of model pipelines grows (e.g. currently in progress: revert risk, add-a-link, reference quality), the overhead of a manual process might become tedious
How should we handle a new model artifact? Some options:
- Continue manual process. This is the current approach, the airflow job can be configured be send an email notification requesting a deploy
- Move model version config to storage (e.g. in swift alongside model artifacts) instead of code (helm chart). A pipeline would write a suitable artifact to that storage and update the config.
- A full-blown model registry, either homegrown or library based.