Page MenuHomePhabricator

Grafana alerts unreliable since 29 August 2023 due to DatasourceError
Closed, ResolvedPublic

Description

Since 29 August 2023, several alerts from the Perf Team (some of which are now under MediaWiki Platform Team, ref T345190) have been failing every other day with a DatasourceError.

Examples:

Screenshot 2023-09-22 at 19.02.13.png (1×714 px, 231 KB)
Screenshot 2023-09-22 at 19.01.36.png (1×2 px, 600 KB)

https://grafana-rw.wikimedia.org/alerting/grafana/dvAFSjJ4k/view

Sep 21, 2023
Labels
alertname = DatasourceError
__alert_rule_uid__ = dvAFSjJ4k
__contacts__ = "AlertManager","cxserver"
dashboard = https://grafana.wikimedia.org/d/000000402/resourceloader-alerts
drilldown = https://grafana.wikimedia.org/d/000000066/resourceloader
grafana_folder = MediaWiki Engineering Team
rule_uid = dvAFSjJ4k
rulename = resourceloader Backend Timing p75
severity = critical
team = mediawiki-platform
tool = resourceloader
Annotations
Error = [plugin.downstreamError] failed to query data: Post "https://graphite.wikimedia.org/render": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
__alertId__ = 114
__dashboardUid__ = 000000402
__orgId__ = 1
__panelId__ = 16
Aug 30, 2023
alertname = DatasourceError
__alert_rule_uid__ = US0KIC1Vk
__contacts__ = "AlertManager","cxserver"
dashboard = https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts
grafana_folder = Performance Team
metric = loadeventend
rule_uid = US0KIC1Vk
rulename = navtiming event rate overall
severity = critical
team = perf
tool = rum
Annotations
Error = [plugin.downstreamError] failed to query data: Post "https://graphite.wikimedia.org/render": context deadline exceeded
__alertId__ = 118
__dashboardUid__ = 000000326
__orgId__ = 1
__panelId__ = 34

Event Timeline

This is very annoying. Can we turn off the alerts while they're not working correctly? I don't want to add a mail filter for myself to ignore them.

Change 966201 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: bump uWSGImaxVars

https://gerrit.wikimedia.org/r/966201

I'm looking into this and noticed mod_uwsgi errors in apache error log (regularly, but also around the time of the errors) for example

[Tue Sep 19 22:29:29.556922 2023] [:error] [pid 1749049:tid 140555346503424] [client 2620:0:861:102:10:64:16:81:60842] uwsgi: max number of uwsgi variables reached. consider increasing it with uWSGImaxVars directive, referer: https://graphite.wikimedia.org/

I'm not 100% sure what's going on and why we're hitting the limit, bumping it though seems to fix things

Change 966201 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: bump uWSGImaxVars

https://gerrit.wikimedia.org/r/966201

Change 966202 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] idp: bump graphite uWSGImaxVars

https://gerrit.wikimedia.org/r/966202

Change 966202 merged by Filippo Giunchedi:

[operations/puppet@production] idp: bump graphite uWSGImaxVars

https://gerrit.wikimedia.org/r/966202

I'm looking into this and noticed mod_uwsgi errors in apache error log (regularly, but also around the time of the errors) for example

[Tue Sep 19 22:29:29.556922 2023] [:error] [pid 1749049:tid 140555346503424] [client 2620:0:861:102:10:64:16:81:60842] uwsgi: max number of uwsgi variables reached. consider increasing it with uWSGImaxVars directive, referer: https://graphite.wikimedia.org/

I'm not 100% sure what's going on and why we're hitting the limit, bumping it though seems to fix things

Change is live, I'm leaving the task open for confirmation things are well in a few days

I got another alert this night, so it doesn't seem fixed.

The last round of alerts also has

failed to execute query A: Get "https://thanos-query.discovery.wmnet/api/v1/query_range?...": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The underlying issue for graphite is the daily logrotate cron that restarts graphite-web, and thus graphite is temporarily unavailable. We (observability) have discussed this internally and concluded that DatasourceError notifications should be disabled since they are not actionable for alert notifications recipients. I have updated the "known issues" Grafana page with instructions on how to do so: https://wikitech.wikimedia.org/wiki/Grafana#DatasourceError_notification_spam

Please take a look at the link and let us know if the instructions are clear on what to do!

Thanks for the instructions! I updated the alerts in the resourceloader-alerts dashboard.

fgiunchedi claimed this task.

You are welcome @Tgr, I'm going to consider the task resolved, though feel free to reopen if sth is amiss and/or not clear