Page MenuHomePhabricator

Model monitoring
Open, Needs TriagePublic

Description

As an ML engineer,

I want to be able to publish metrics from model servers, so I can create dashboards and alerts using these metrics so that I can easily have a clear overview of the models performance and observe if there are differences in how the model is performing (anomalous behavior, drift detection etc).
We are referring to metrics related to the models and system metrics. For example in a binary classification model a metric could represent the predicted class or the probability.

We need to define a set of specific metrics in order to follow the same conventions for all models/model servers. e.g. 'predicted_class'

Event Timeline

Some initial thoughts on this:
My suggestion is to use WMF's existing stack (prometheus - grafana - alertmanager) for this kind of work. We could use Prometheus Pushgateway and push a metric or set of metrics each time a prediction is made using a model server. These metrics then can be visualized in Grafana dashboards.
an example:

  • We have a binary classification model that predicts if a revision is an act of vandalism or not. This model is deployed as a model server on Lift Wing.
  • Each time a request is made and we run inference using this model a metric named predicted_class is pushed.
  • We use these metrics to create grafana dashboard(s) that visualizes the distribution of the classes. This way when we get a request like "the model seems to be performing strangely" we have something that the users can easily check to verify their initial hypothesis. This dashboard may have additional dashboards that are related to that particular model. In this example it could also have total number of revisions which would explain why we may have many vandalous edits.
  • Finally we can have the ability to set an alert for this model related to the aforementioned metric. Such alerts should be chosen carefully to avoid having an overwhelming number of false positives. An example here would be "fire an alert if the model predicts only the same class during the last X minutes/hours

It would be important to establish a convention for naming metrics so that we don't quickly end up with something unmanagable.

+1 I like the idea!

I'd avoid the push gateway if possible, we could try to use a simple prometheus exporter for this job (maybe there is a way to expose metrics via Kserve/fastapi).