Page MenuHomePhabricator

Deployment of model updates
Open, Needs TriagePublic

Description

Once models are in production in liftwing, what should the process be for updating a model? This is assuming no code changes in the inference service are required.

Current workflow is a manual workflow:

  • a researcher creates a model artifact, e.g. in a jupyter notebook
  • a phab task (example for the revert risk model: T360423) is used to track the re-deployment of the model to liftwing, done by a ml engineer

Research is moving towards making model training reproducible,

  • there will be a pipeline that produces a model artifact
  • model retraining pipelines can scheduled (e.g. produce a new model artifact every month)
  • model training can be triggered, e.g. detect model drift, manually
  • as the number of model pipelines grows (e.g. currently in progress: revert risk, add-a-link, reference quality), the overhead of a manual process might become tedious

How should we handle a new model artifact? Some options:

  • Continue manual process. This is the current approach, the airflow job can be configured be send an email notification requesting a deploy
  • Move model version config to storage (e.g. in swift alongside model artifacts) instead of code (helm chart). A pipeline would write a suitable artifact to that storage and update the config.
  • A full-blown model registry, either homegrown or library based.