Page MenuHomePhabricator

Upgrade to Grafana 9
Closed, ResolvedPublic

Description

As per task title, we should be upgrading to Grafana 9

Event Timeline

colewhite triaged this task as Medium priority.

Change 886860 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: disable grafana db sync ahead of 9.x upgrade

https://gerrit.wikimedia.org/r/886860

Change 886861 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/debs/grafana-plugins@master] Upgrade plugins

https://gerrit.wikimedia.org/r/886861

Change 886860 merged by Cwhite:

[operations/puppet@production] profile: disable grafana db sync ahead of 9.x upgrade

https://gerrit.wikimedia.org/r/886860

grafana-ldap-users-sync appears to work against Grafana 9

Change 890849 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: Re-enable grafana db sync post 9.x upgrade

https://gerrit.wikimedia.org/r/890849

Change 886861 merged by Cwhite:

[operations/debs/grafana-plugins@master] Upgrade plugins

https://gerrit.wikimedia.org/r/886861

Mentioned in SAL (#wikimedia-operations) [2023-02-21T17:25:59Z] <cwhite> Grafana 9x upgrade in production complete T317887

Change 890849 merged by Cwhite:

[operations/puppet@production] profile: Re-enable grafana db sync post 9.x upgrade

https://gerrit.wikimedia.org/r/890849

Hi! I wonder if we need to do something on our side with the alerts in 9? We have a lot of fired alerts with "No data" but I can see that the data is there (or at least in the same interval as before), so I think there could a difference in how "No data" is handled between the last and the current version, do you know?

Yeah, one example for verbosity:

[14:20:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData

Happened to capture the dashboard when one of the alerts was going off:

2023-02-22-160231_692x225_scrot.png (225×692 px, 43 KB)

Hi! I wonder if we need to do something on our side with the alerts in 9? We have a lot of fired alerts with "No data" but I can see that the data is there (or at least in the same interval as before), so I think there could a difference in how "No data" is handled between the last and the current version, do you know?

Yes, the difference is Grafana's new alerting engine has become the default. It can be rolled back, but the legacy alerting engine is expected to be removed in Grafana 10.

Looks like alerts must now be told what to do when encountering a "No Data" state. They've removed the "keep last state" option and default now to the no data state and fires an alert.

Spot checking a few panels, it looks to me like the alerts that have fired this way also have gaps in the data based on their evaluation rules or are not checking anything like an alert that has not been updated to reflect infrastructure changes.

To mitigate the noise in the short term, I've put in a silence (expires 2024-02) on alertname=(DatasourceNoData|DatasourceError).

Screenshot from 2023-02-22 18-42-54.png (804×1 px, 216 KB)

Adding for completeness, the update to 9.3.6 also addressed this 9.x-specific security issue (CVE-2023-22462)
https://github.com/grafana/grafana/security/advisories/GHSA-7rqg-hjwc-6mjf

8.x doesn't include the vulnerable code, so this never affected production.

I think something else broke with the 9 upgrade with the alerts. I checked one alert and I think somehow the query got corrupted when we converted it:

https://grafana-rw.wikimedia.org/alerting/IS0KIC14k/edit?returnTo=%2Fd%2F000000326%2Fnavigation-timing-alerts%3FforceLogin%26orgId%3D1%26refresh%3D5m%26from%3Dnow-30d%26to%3Dnow%26viewPanel%3D4%26editPanel%3D4%26tab%3Dalert

That alert query seems to have been converted wrongly, and when the query is broken we get "no data". The query looks like this:

alias(diffSeries(movingAverage(frontend.navtiming2.firstPaint.mobile.anonymous.p75, '24h'), timeShift(movingAverage(frontend.navtiming2.firstPaint.mobile.anonymous.p75, '24h'), '7d'))

There's a miss-match with parentheses so the alias is broken. If fix that (remove the broken alias part) it works for me.

I'll fix that manually now and will go through the rest of our alerts during the day.

I also have problems with other alerts. There were to alerts in https://grafana-rw.wikimedia.org/d/000000318/browsertime-alerts that correctly fired because the limits where hit. I increased the limits and run the alert queries in the GUI clicking on the preview button:

Screenshot 2023-03-08 at 09.55.36.png (788×2 px, 162 KB)

There it looks correct, I can see that the query is correct and generate data and the status is "normal".

But I go back to the dashboard where the alert originally was created it shows "no data":

Screenshot 2023-03-08 at 09.54.08.png (406×2 px, 149 KB)

https://grafana-rw.wikimedia.org/alerting/eV0FSCJVz/edit?returnTo=%2Fd%2F000000318%2Fbrowsertime-alerts%3ForgId%3D1%26refresh%3D5m%26forceLogin%26editPanel%3D38%26tab%3Dalert

I want more time than how often it runs but it seems to be stuck in that "no data" mode.

I've gone through all alerts on our side and made sure they do not fire (or at least I think I fixed them all), However all of them that fired are still stuck in the old state or "no data" state. However running the alerts preview I can see that they get data and are not firing.

Hi @Peter!

We were serving grafana from codfw March 6-8th. I saw that silences had not copied from eqiad->codfw for some unknown reason. We are back to serving grafana from eqiad as of March 8th 1800 UTC.

You may want to double-check these changes and make sure they made it back to the eqiad grafana instance.

Change 906537 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] aptrepo: go with Grafana 9 only

https://gerrit.wikimedia.org/r/906537

Change 906537 merged by Filippo Giunchedi:

[operations/puppet@production] aptrepo: go with Grafana 9 only

https://gerrit.wikimedia.org/r/906537

Updating quarter milestone as there is one lingering task to complete: resolve the root cause of DatasourceNoData and DatasourceError alerts prior to silence expiring on 2024-02-01.

lmata subscribed.

hi @colewhite, a friendly reminder that the silence is expiring on 2024-02-01.

hi @colewhite, a friendly reminder that the silence is expiring on 2024-02-01.

Thanks for the reminder!

With the dust settled from the DatasourceError round, I think it's time to unsilence this as well. There are three valid alerts AFAICS.

Perf:

  1. Increased first visual change when searching on Obama at Google coming to Wikipedia on emulated mobile
  2. Increased first visual change when searching on Obama from the portal start page

Readers Web:

  1. Vector a11y Errors

These are missing data for a long time and probably either need to be removed or fixed.

The remaining three alerts are unsilenced.

This task finished!