Page MenuHomePhabricator

Upgrade to Grafana 11
Closed, ResolvedPublic

Description

The task tracks Grafana 11 upgrade, rough steps off the top of my head:

  • Validate the new deb and puppet keeps working as expected in a testing environment (e.g. Pontoon)
  • Temp disable sync between grafana hosts in production and upgrade (dpkg -i) only grafana-next
  • Verify dashboards work as expected on grafana-next, and invite users to try it out too. Make sure T384831 is fixed as well.
  • Change reprepro to import grafana 11 debs
  • After verification period is over, upgrade grafana.w.o too
  • Re enable sync between grafana hosts

Event Timeline

andrea.denisse changed the task status from Open to In Progress.Apr 23 2025, 8:11 PM
andrea.denisse claimed this task.

I'll test Grafana v11.6.1 (the latest release as of today).

Mentioned in SAL (#wikimedia-operations) [2025-04-29T20:02:25Z] <denisse> disabling Puppet on grafana2001 - T384841

Change #1140523 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Add conditional data sync via enable_sync hiera variable

https://gerrit.wikimedia.org/r/1140523

Change #1140760 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Toggle data sync using feature flag

https://gerrit.wikimedia.org/r/1140760

Change #1140760 merged by Andrea Denisse:

[operations/puppet@production] grafana: Add enable_dashboard_sync feature flag in hiera

https://gerrit.wikimedia.org/r/1140760

Change #1140523 merged by Andrea Denisse:

[operations/puppet@production] grafana: Toggle data sync using feature flag

https://gerrit.wikimedia.org/r/1140523

Mentioned in SAL (#wikimedia-operations) [2025-05-06T16:34:15Z] <denisse> enable Puppet on Grafana2001 - T384841

I found an issue where after disabling dashboard sync the stunnel4 unit fails due to the ports still being bounded on both hosts to rsync but but there was no configuration in place which caused the unit to fail.
Manually restarting the unit didn't work as stunnel4 was still running even if the unit was stopped or restarted. This required me to gracefully kill the process and start the unit again.
I'll work on a patch for this.

Change #1143660 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Enable dashboard sync between hosts

https://gerrit.wikimedia.org/r/1143660

Change #1143660 merged by Andrea Denisse:

[operations/puppet@production] grafana: Enable dashboard sync between hosts

https://gerrit.wikimedia.org/r/1143660

andrea.denisse updated the task description. (Show Details)

I found an issue where after disabling dashboard sync the stunnel4 unit fails due to the ports still being bounded on both hosts to rsync but but there was no configuration in place which caused the unit to fail.
Manually restarting the unit didn't work as stunnel4 was still running even if the unit was stopped or restarted. This required me to gracefully kill the process and start the unit again.
I'll work on a patch for this.

I'll create a separate task to track the stunnel4 issue.
The upgrade is complete now.

I've noticed some weird behavior on the LVS dashboard:

Rendering a graph as an image seems broken, the generation request ends up returning a 504 after 65289 ms.

I've noticed some weird behavior on the LVS dashboard:

Hi @Vgutierrez , thanks for the report. I can reproduce the issue (attaching a screenshot), I opened T394045 to track this.

img-2025-05-13-08-16-34.png (637×1 px, 124 KB)

Rendering a graph as an image seems broken, the generation request ends up returning a 504 after 65289 ms.

Hi @Clement_Goubert , thanks for the bug report. Could you please share a dashboard where you're seeing this behavior along with the steps you took for it to happen? I'd like to reproduce it to investigate it further.

Rendering a graph as an image seems broken, the generation request ends up returning a 504 after 65289 ms.

Hi @Clement_Goubert , thanks for the bug report. Could you please share a dashboard where you're seeing this behavior along with the steps you took for it to happen? I'd like to reproduce it to investigate it further.

Sure thing.

It's not a guaranteed failure, and I think any dashboard would work, but let's take https://grafana.wikimedia.org/goto/FQCY_I-Hg?orgId=1 and https://grafana.wikimedia.org/goto/XyNV9S-HR?orgId=1 for example. Both of them have failed, and worked, in the past ten minutes.

  • Click on the graph hamburger menu
  • Select Share -> Share Link
  • Click Generate Image

Expected: An image of the graph for the selected time range is generated and displayed in the browser
Actual: Fails due to timeout. It does not happen every time, but the operation is always slow.

image.png (589×1 px, 52 KB)

Clicking "Shorten Link" doesn't change anything. The "Download Image" button is always greyed out.

Rendering a graph as an image seems broken, the generation request ends up returning a 504 after 65289 ms.

Hi @Clement_Goubert , thanks for the bug report. Could you please share a dashboard where you're seeing this behavior along with the steps you took for it to happen? I'd like to reproduce it to investigate it further.

Sure thing.

It's not a guaranteed failure, and I think any dashboard would work, but let's take https://grafana.wikimedia.org/goto/FQCY_I-Hg?orgId=1 and https://grafana.wikimedia.org/goto/XyNV9S-HR?orgId=1 for example. Both of them have failed, and worked, in the past ten minutes.

  • Click on the graph hamburger menu
  • Select Share -> Share Link
  • Click Generate Image

Expected: An image of the graph for the selected time range is generated and displayed in the browser
Actual: Fails due to timeout. It does not happen every time, but the operation is always slow.

image.png (589×1 px, 52 KB)

Clicking "Shorten Link" doesn't change anything. The "Download Image" button is always greyed out.

Thank you @Clement_Goubert , I was able to reproduce the issue and created T394069 to track this.