Page MenuHomePhabricator

Get Kartotherian SLO metrics into Prometheus
Open, MediumPublic

Description

Kartotherian is currently monitored by statsd/Graphite; the SLO monitoring infrastructure is pointed at Prometheus. That means that out of the box, we can't create a Kartotherian SLO dashboard. There are a few ways we could resolve that:

  • We could put Envoy in front of Kartotherian, and use its telemetry for errors and latency instead of Kartotherian's.
  • We could update the SLO dashboard template to read from Graphite as well as from Prometheus.
  • We could use the Prometheus pushgateway to slurp those metrics over from statsd, although it isn't a great fit.

Presently, we're leaning toward Envoy -- we trust its reporting a little more, especially around timeouts, and it would also get us the other usual traffic-management benefits. If that turns out to be more complex than expected, we'll look into one of the other options.

The best option is for us to use the pending native Prometheus support within Kartotherian itself

Event Timeline

RLazarus triaged this task as Medium priority.Oct 13 2022, 5:32 PM
RLazarus created this task.

With my Observability/Prometheus hat on: to bridge the statsd/prometheus gap we've been deploying profile::prometheus::statsd_exporter e.g. in swift and thumbor. Assuming the statsd metrics already contain what you are after, then defining a profile::prometheus::statsd_exporter::mappings is relatively simple (there are examples in puppet and I'm happy to assist). As an added bonus you can reuse the mapping when/if karto moves to k8s! HTH as a variation on using pushgateway (which I agree isn't a great fit)

I'd be curious to hear @Jgiannelos's input on this one - if we want to not bother rewriting Kartotherian to speak to Prometheus directly via the service_runner module upgrade then I think using the statsd_exporter makes sense as we can just reuse this logic as mentioned which would be a nice time saver. Putting Kartotherian behind the service proxy fits with our overall model for services but will involve some more substantial changes to the path through which users access maps which could lead to more service outages or at least increase the time required to get us the metrics. I'm pretty easy with either approach.

In general I feel like updating the template to read from Graphite is a bit of a regression.

The effort required to configure service runner to migrate from statsd to prometheus is not that much (its abstracted so its a matter of configuration). That said this involves more effort on rebuilding the existing grafana charts because the metrics/queries are going to be different.
On top of that I am not very confident about the quality of the metrics instrumentation kartotherian has so I believe using a different source of metrics (eg. envoy) might be a good option at this point.

Overall if we want to move kartotherian to prometheus metrics it is an option with not that much of time investment in the codebase level.

I hadn't considered how we get traffic to Kartotherian - for the most part we just directly rewrite requests for maps.wikimedia.org to kartotherian.discovery.wmnet in Trafficserver. Given that we're not making cross-service requests via Mediawiki or similar I don't know if we can easily integrate the services proxy into this path, so I think that decides the use of the statsd exporter for us for the short-term. We will have to rebuild our existing dashboards/graphs either way so we'll need to be careful.

Change 844494 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/services/kartotherian@master] Update deprecated metrics usages; speak prometheus

https://gerrit.wikimedia.org/r/844494

Change 852880 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/services/kartotherian@mapnik-3.1] Update deprecated metrics usages; speak prometheus

https://gerrit.wikimedia.org/r/852880

Change 844494 merged by jenkins-bot:

[mediawiki/services/kartotherian@master] Update deprecated metrics usages; speak prometheus

https://gerrit.wikimedia.org/r/844494

Change 852880 merged by jenkins-bot:

[mediawiki/services/kartotherian@mapnik-3.1] Update deprecated metrics usages; speak prometheus

https://gerrit.wikimedia.org/r/852880