Page MenuHomePhabricator

Add prometheus metrics for KFServer-based model HTTP servers
Closed, ResolvedPublic

Description

The current way to publish ORES metrics seems to be the following:

  • Metrics are pushed to a statsd endpoint (basically it supports Graphite only)
  • For every ores node, there is a local statsd endpoint provided by a prometheus exporter, that in turn offerts an http api to collect metrics (used by the prometheus master nodes)

With the transition to KFServer-based models, we don't really have any metrics. We should figure out if it is possible to add Prometetheus metrics, or similar, so that we'll be able to replicate dashboards like https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m

Event Timeline

I believe our sandbox clusters use the prometheus operator: https://github.com/kserve/kserve/tree/master/docs/samples/metrics-and-monitoring

[...] access Prometheus metrics that are automatically generated by Knative's queue-proxy container for your KFServing models.

We may not even need to drill down into the HTTP servers, most of what we need to recreate the ORES dashboard should be available from Knative and/or pod-level metrics (cpu/mem etc.)

elukey claimed this task.

Ah interesting! So the Knative queue-proxy container publishes a lot of metrics, it should be a matter of adding support for Knative in our deployment settings. Will work on it then in T289841, no need to keep another task open. Thanks!