Deployment of model updates
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	fkaelin
	Mon, Jun 3, 7:47 PM

Description

Once models are in production in liftwing, what should the process be for updating a model? This is assuming no code changes in the inference service are required.

Current workflow is a manual workflow:

a researcher creates a model artifact, e.g. in a jupyter notebook
a phab task (example for the revert risk model: T360423) is used to track the re-deployment of the model to liftwing, done by a ml engineer

Research is moving towards making model training reproducible,

there will be a pipeline that produces a model artifact
model retraining pipelines can scheduled (e.g. produce a new model artifact every month)
model training can be triggered, e.g. detect model drift, manually
as the number of model pipelines grows (e.g. currently in progress: revert risk, add-a-link, reference quality), the overhead of a manual process might become tedious

How should we handle a new model artifact? Some options:

Continue manual process. This is the current approach, the airflow job can be configured be send an email notification requesting a deploy
Move model version config to storage (e.g. in swift alongside model artifacts) instead of code (helm chart). A pipeline would write a suitable artifact to that storage and update the config.
A full-blown model registry, either homegrown or library based.