Improve grafana dashboard for monitoring Toolhub in production
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Apr 11 2022, 10:37 PM

Description

In T280881#7598760, @akosiaris wrote:

Yes there is (now), at least for the dashboard part. I just added https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Pipeline and edited a few more pages to make this part more prominent and easier to find.

I 've also gone ahead and did the process once more to make sure it's fine, the toolhub dashboard is under https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m

The structure we 'd like to have is already in place, data not so much. Saturation wise, both at the total as well as the per container level, the service is good to go, but we 'll need to expose metrics for traffic, errors and latencies as well. If there are any service specific metrics that the service already emits, we can expose those too.

I 've left the total saturation row expanded for now, but this is mostly cause the other rows don't have anything as they are geared towards showing servicerunner services.

The traffic, errors, and latency signals on this dashboard are based on service_runner_* metrics which are produced by service-template-node. We need to find/create replacement metrics which will work for our Python+Django application instead.

The application exposes a number of custom metrics at https://toolhub.wikimedia.org/metrics which are produced by https://github.com/korfuri/django-prometheus

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		bd808	T305899 Improve grafana dashboard for monitoring Toolhub in production
Invalid		None	T305902 Injest Toolhub custom prometheus metrics
Resolved	BUG REPORT	bd808	T306352 Exclude /metrics from TLS enforcement

Event Timeline

bd808 created this task.Apr 11 2022, 10:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 11 2022, 10:37 PM

bd808 triaged this task as Medium priority.Apr 11 2022, 10:37 PM

bd808 moved this task from Backlog to Groomed/Ready on the Toolhub board.

As pointed out in T305902, those metrics are already scraped by prometheus for a pretty long time now. What's left is to actually alter the dashboard to reference those metrics and not the service-runner ones.

In T305899#7850999, @akosiaris wrote:

As pointed out in T305902, those metrics are already scraped by prometheus for a pretty long time now. What's left is to actually alter the dashboard to reference those metrics and not the service-runner ones.

understood

bd808 renamed this task from Create/improve grafana dashboard for monitoring Toolhub in production to Improve grafana dashboard for monitoring Toolhub in production.Apr 15 2022, 11:38 PM

I have attempted to fix the Traffic and Errors rows on https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m. I'm honestly not convinced that I have the sum(rate(...)) things quite right. The shape of the curves all look reasonable, but the amplitude seems too high even after I filtered out all of the /healthz and /metrics requests from the time series queries.

These are the queries that I used:

Traffic / Total: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))
Traffic / by HTTP method: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (method)
Traffic / by endpoint: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (view)
Traffic / by HTTP status: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
Errors / HTTP errors: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
Errors / current error rate: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m]))
Errors / current error %: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) / sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))

The next bit to do is latency using the django_http_requests_latency_seconds_by_view_method_* metrics.

@bd808

I think that you are right. Those numbers don't look ok. And I do notice the following

curl -v  http://10.64.64.13:8000/metrics
* Expire in 0 ms for 6 (transfer 0x55f5c7bab0f0)
*   Trying 10.64.64.13...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55f5c7bab0f0)
* Connected to 10.64.64.13 (10.64.64.13) port 8000 (#0)
> GET /metrics HTTP/1.1
> Host: 10.64.64.13:8000
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Date: Mon, 18 Apr 2022 14:21:59 GMT
< Server: WSGIServer/0.2 CPython/3.7.3
< Content-Type: text/html; charset=utf-8
< Location: https://toolhub.wikimedia.org/metrics
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< X-Request-ID: f94727d19759410f8aba8f003e518c90
< Connection: close

which just doesn't look right. I wonder what prometheus does in this case.

In T305899#7861063, @akosiaris wrote:

which just doesn't look right. I wonder what prometheus does in this case.

That 301 redirect is basically the same issue that we saw with /healthz checks in T294072. I'll put up a patch to make the same fix for /metrics.

bd808 mentioned this in T306352: Exclude /metrics from TLS enforcement.Apr 18 2022, 3:51 PM

bd808 added a subtask: T306352: Exclude /metrics from TLS enforcement.

bd808 changed the status of subtask T306352: Exclude /metrics from TLS enforcement from Open to In Progress.

In T305899#7858965, @bd808 wrote:

I have attempted to fix the Traffic and Errors rows on https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m. I'm honestly not convinced that I have the sum(rate(...)) things quite right. The shape of the curves all look reasonable, but the amplitude seems too high even after I filtered out all of the /healthz and /metrics requests from the time series queries.

These are the queries that I used:

Traffic / Total: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))

Traffic / by HTTP method: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (method)

Traffic / by endpoint: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (view)

Traffic / by HTTP status: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)

Errors / HTTP errors: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)

Errors / current error rate: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m]))

Errors / current error %: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) / sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))

Your queries were mostly correct. What was missing was filtering on site and specific prometheus. So we were seeing aggregate metrics across eqiad and codfw datacenters and across releases (staging, production). I 've changed via a sed invocation all expressions and added site="$site", prometheus="$prometheus". e.g.

sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",site="$site", prometheus="$prometheus", view!="healthz",view!="prometheus-django-metrics"}[5m]))

After doing that, the amplitude of the graphs has been more than halved.

I think that the view might still be skewed somewhat by the 301 redirect (the totals maybe not so much, but the per instance definitely), after T306352 is resolved we will know if my hunch was correct or not.

One question that does arise still is for the Errors row. Specifically, how are errors defined. That is, is a 404 (and other 4xx) status code an error that you care for or not? Depending on the answer, status=~"4..|5.." might need to be changed to status=~"5.."

The next bit to do is latency using the django_http_requests_latency_seconds_by_view_method_* metrics.

So, prometheus follows redirects by default. In https://github.com/prometheus/prometheus/commit/646556a2632700f7fca42cec51d0100294d43c52 support for disabling that functionality was added. I 'd say that for production we want to default to not following redirects, it should help us avoid weird edge cases like this one.

akosiaris mentioned this in T306419: Prometheus: Disable following 3xx redirects by default in at least kubernetes pods scraping.Apr 19 2022, 8:36 AM

In T305899#7863175, @akosiaris wrote:

One question that does arise still is for the Errors row. Specifically, how are errors defined. That is, is a 404 (and other 4xx) status code an error that you care for or not? Depending on the answer, status=~"4..|5.." might need to be changed to status=~"5.."

401, 403, and 404 at least would be expected responses for a number of the API routes. Excluding 404 from the error calculation seems to be needed in order to make the metric make any sense as the "favorites" feature is likely to raise a 404 for the vast majority of authenticated viewers of a toolinfo detail record. I have updated the guards to use status=~"4.[^4]|5.." for now.

bd808 closed subtask T306352: Exclude /metrics from TLS enforcement as Resolved.Apr 25 2022, 8:33 PM

In T305899#7861232, @bd808 wrote:

In T305899#7861063, @akosiaris wrote:

which just doesn't look right. I wonder what prometheus does in this case.

That 301 redirect is basically the same issue that we saw with /healthz checks in T294072. I'll put up a patch to make the same fix for /metrics.

After fixing the redirect metrics dropped down to levels that make more logical sense (<1 rps). There must have been something very strange happening because of the redirect for the traffic to have been reporting 2000% higher than reality.

I think I've got the latency graphs working now. It took me a while to figure out that the template dashboard was using recording rules for some things that I needed to replace with reporting time queries for Toolhub. But TIL! :)

@akosiaris would you please take a look at https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub and let me know if you see more things that I should fix up for the basic signals?

In T305899#7889812, @bd808 wrote:

In T305899#7861232, @bd808 wrote:

In T305899#7861063, @akosiaris wrote:

which just doesn't look right. I wonder what prometheus does in this case.

That 301 redirect is basically the same issue that we saw with /healthz checks in T294072. I'll put up a patch to make the same fix for /metrics.

After fixing the redirect metrics dropped down to levels that make more logical sense (<1 rps). There must have been something very strange happening because of the redirect for the traffic to have been reporting 2000% higher than reality.

I was pretty sure that 2000% must have been a typo. And then I noticed the following in the 90 graph for traffic

OK, this is definitely weird. Very weird. If anyone has a theory as to why we seem to have had a feedback loop that ended up continuously increasing not just the prometheus counters, but the actual rate above said counters were increasing at, I am all ears.

In T305899#7889985, @bd808 wrote:

I think I've got the latency graphs working now. It took me a while to figure out that the template dashboard was using recording rules for some things that I needed to replace with reporting time queries for Toolhub. But TIL! :)

@akosiaris would you please take a look at https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub and let me know if you see more things that I should fix up for the basic signals?

LGTM. Fixed a minor typo that caught my eye in a formula (method vs mehtod). This looks pretty awesome. I think we can call this resolved. Many thanks!

Someday™ it might be interesting to make a related board graphing some of the database and cache data points, but honestly usage is so low right now that this doesn't seem a valuable use of time.

Restricted Application added a project: User-bd808. · View Herald TranscriptApr 29 2022, 4:54 PM

	F35073265: image.png
	Apr 29 2022, 1:27 PM

Improve grafana dashboard for monitoring Toolhub in productionClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Improve grafana dashboard for monitoring Toolhub in production
Closed, ResolvedPublic
Actions

Related Objects
Search...