Page MenuHomePhabricator

Improve grafana dashboard for monitoring Toolhub in production
Closed, ResolvedPublic

Description

Yes there is (now), at least for the dashboard part. I just added https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Pipeline and edited a few more pages to make this part more prominent and easier to find.

I 've also gone ahead and did the process once more to make sure it's fine, the toolhub dashboard is under https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m

The structure we 'd like to have is already in place, data not so much. Saturation wise, both at the total as well as the per container level, the service is good to go, but we 'll need to expose metrics for traffic, errors and latencies as well. If there are any service specific metrics that the service already emits, we can expose those too.

I 've left the total saturation row expanded for now, but this is mostly cause the other rows don't have anything as they are geared towards showing servicerunner services.

The traffic, errors, and latency signals on this dashboard are based on service_runner_* metrics which are produced by service-template-node. We need to find/create replacement metrics which will work for our Python+Django application instead.

The application exposes a number of custom metrics at https://toolhub.wikimedia.org/metrics which are produced by https://github.com/korfuri/django-prometheus

Event Timeline

bd808 triaged this task as Medium priority.Apr 11 2022, 10:37 PM
bd808 moved this task from Backlog to Groomed/Ready on the Toolhub board.

As pointed out in T305902, those metrics are already scraped by prometheus for a pretty long time now. What's left is to actually alter the dashboard to reference those metrics and not the service-runner ones.

As pointed out in T305902, those metrics are already scraped by prometheus for a pretty long time now. What's left is to actually alter the dashboard to reference those metrics and not the service-runner ones.

understood

bd808 renamed this task from Create/improve grafana dashboard for monitoring Toolhub in production to Improve grafana dashboard for monitoring Toolhub in production.Apr 15 2022, 11:38 PM

I have attempted to fix the Traffic and Errors rows on https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m. I'm honestly not convinced that I have the sum(rate(...)) things quite right. The shape of the curves all look reasonable, but the amplitude seems too high even after I filtered out all of the /healthz and /metrics requests from the time series queries.

These are the queries that I used:

  • Traffic / Total: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))
  • Traffic / by HTTP method: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (method)
  • Traffic / by endpoint: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (view)
  • Traffic / by HTTP status: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
  • Errors / HTTP errors: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
  • Errors / current error rate: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m]))
  • Errors / current error %: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) / sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))

The next bit to do is latency using the django_http_requests_latency_seconds_by_view_method_* metrics.

@bd808

I think that you are right. Those numbers don't look ok. And I do notice the following

curl -v  http://10.64.64.13:8000/metrics
* Expire in 0 ms for 6 (transfer 0x55f5c7bab0f0)
*   Trying 10.64.64.13...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55f5c7bab0f0)
* Connected to 10.64.64.13 (10.64.64.13) port 8000 (#0)
> GET /metrics HTTP/1.1
> Host: 10.64.64.13:8000
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Date: Mon, 18 Apr 2022 14:21:59 GMT
< Server: WSGIServer/0.2 CPython/3.7.3
< Content-Type: text/html; charset=utf-8
< Location: https://toolhub.wikimedia.org/metrics
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< X-Request-ID: f94727d19759410f8aba8f003e518c90
< Connection: close

which just doesn't look right. I wonder what prometheus does in this case.

which just doesn't look right. I wonder what prometheus does in this case.

That 301 redirect is basically the same issue that we saw with /healthz checks in T294072. I'll put up a patch to make the same fix for /metrics.

I have attempted to fix the Traffic and Errors rows on https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m. I'm honestly not convinced that I have the sum(rate(...)) things quite right. The shape of the curves all look reasonable, but the amplitude seems too high even after I filtered out all of the /healthz and /metrics requests from the time series queries.

These are the queries that I used:

  • Traffic / Total: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))
  • Traffic / by HTTP method: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (method)
  • Traffic / by endpoint: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (view)
  • Traffic / by HTTP status: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
  • Errors / HTTP errors: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
  • Errors / current error rate: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m]))
  • Errors / current error %: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) / sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))

Your queries were mostly correct. What was missing was filtering on site and specific prometheus. So we were seeing aggregate metrics across eqiad and codfw datacenters and across releases (staging, production). I 've changed via a sed invocation all expressions and added site="$site", prometheus="$prometheus". e.g.

sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",site="$site", prometheus="$prometheus", view!="healthz",view!="prometheus-django-metrics"}[5m]))

After doing that, the amplitude of the graphs has been more than halved.

I think that the view might still be skewed somewhat by the 301 redirect (the totals maybe not so much, but the per instance definitely), after T306352 is resolved we will know if my hunch was correct or not.

One question that does arise still is for the Errors row. Specifically, how are errors defined. That is, is a 404 (and other 4xx) status code an error that you care for or not? Depending on the answer, status=~"4..|5.." might need to be changed to status=~"5.."

The next bit to do is latency using the django_http_requests_latency_seconds_by_view_method_* metrics.

So, prometheus follows redirects by default. In https://github.com/prometheus/prometheus/commit/646556a2632700f7fca42cec51d0100294d43c52 support for disabling that functionality was added. I 'd say that for production we want to default to not following redirects, it should help us avoid weird edge cases like this one.

One question that does arise still is for the Errors row. Specifically, how are errors defined. That is, is a 404 (and other 4xx) status code an error that you care for or not? Depending on the answer, status=~"4..|5.." might need to be changed to status=~"5.."

401, 403, and 404 at least would be expected responses for a number of the API routes. Excluding 404 from the error calculation seems to be needed in order to make the metric make any sense as the "favorites" feature is likely to raise a 404 for the vast majority of authenticated viewers of a toolinfo detail record. I have updated the guards to use status=~"4.[^4]|5.." for now.

which just doesn't look right. I wonder what prometheus does in this case.

That 301 redirect is basically the same issue that we saw with /healthz checks in T294072. I'll put up a patch to make the same fix for /metrics.

After fixing the redirect metrics dropped down to levels that make more logical sense (<1 rps). There must have been something very strange happening because of the redirect for the traffic to have been reporting 2000% higher than reality.

I think I've got the latency graphs working now. It took me a while to figure out that the template dashboard was using recording rules for some things that I needed to replace with reporting time queries for Toolhub. But TIL! :)

@akosiaris would you please take a look at https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub and let me know if you see more things that I should fix up for the basic signals?

which just doesn't look right. I wonder what prometheus does in this case.

That 301 redirect is basically the same issue that we saw with /healthz checks in T294072. I'll put up a patch to make the same fix for /metrics.

After fixing the redirect metrics dropped down to levels that make more logical sense (<1 rps). There must have been something very strange happening because of the redirect for the traffic to have been reporting 2000% higher than reality.

I was pretty sure that 2000% must have been a typo. And then I noticed the following in the 90 graph for traffic

image.png (1×2 px, 68 KB)

OK, this is definitely weird. Very weird. If anyone has a theory as to why we seem to have had a feedback loop that ended up continuously increasing not just the prometheus counters, but the actual rate above said counters were increasing at, I am all ears.

I think I've got the latency graphs working now. It took me a while to figure out that the template dashboard was using recording rules for some things that I needed to replace with reporting time queries for Toolhub. But TIL! :)

@akosiaris would you please take a look at https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub and let me know if you see more things that I should fix up for the basic signals?

LGTM. Fixed a minor typo that caught my eye in a formula (method vs mehtod). This looks pretty awesome. I think we can call this resolved. Many thanks!

bd808 claimed this task.
bd808 moved this task from Groomed/Ready to Review on the Toolhub board.

Someday™ it might be interesting to make a related board graphing some of the database and cache data points, but honestly usage is so low right now that this doesn't seem a valuable use of time.