Page MenuHomePhabricator

Keep calculating latencies for MediaWiki requests that happen k8s
Open, MediumPublic

Description

Currently, we extract latency data from the apache logs using mtail. We want to keep doing so on k8s, but this might be challenging given how we will have to manage such logs (see the parent task).

What we have now is a latency histogram divided by:

  • cluster
  • status code
  • request handler (might be superfluous)
  • request method
  • endpoint

I still don't have solutions to this problem, but some ideas towards a solution are already in my mind:

From logfiles on centrallog

  • Save the logs on centrallog with short retention,
  • run mtail on such logs

Modify mtail to be able to consume logs from kafka

  • In this idea, we'd be able to just consume a kafka topic directly from mtail
  • It probably requires more work that we actually want to commit to

Use envoy to extract the same data

  • We currently lack some dimensions to the telemetry, like separation between http verbs.
  • We'd have to force an envoy configuration just to separate the telemetry, but the effort might be worth it.

Event Timeline

Modify mtail to be able to consume logs from kafka

In this idea, we'd be able to just consume a kafka topic directly from mtail
It probably requires more work that we actually want to commit to

An alternative could also be to modify mtail to ship to kafka, while also supporting sampling. That way we could have mtail as a sidecar, consume a pipe (does it support that?), calculate metrics on the total amount of requests per pod, optionally sample and ship to kafka.

Another possible solution is to extract the metrics via a sum aggregation query with prometheus-es-exporter. It's pretty easy to set up, but has the drawback of the logs must be indexed before they can be queried and exported.

My two cents: keeping mtail (or similar, in the envoy case) processing on-host would be ideal I think: it'll be simpler to scale (i.e. mtail doesn't have to deal with the logs firehose) and the blast radius of "mtail down" is the host(s) involved as opposed to all metrics. The trade off is of course the mtail process running/deployed on all hosts involved.

JMeybohm triaged this task as Medium priority.Mar 3 2021, 8:06 AM