Now that Lift Wing seems getting more and more the shape of something production-ready, we should probably think about defining SLIs and SLOs for the inference service.
The end result of this task should be:
- a brief document on Wikitech in which we highlight SLIs/SLOs etc..
- a grafana dashboard to monitor things like the error budget over time. Example: https://grafana.wikimedia.org/d/iyumW7LGz/etcd-slos. Worth to note: https://wikitech.wikimedia.org/wiki/Grafana#Grizzly (IIUC there is a SLO dashboard template that we can use).