Page MenuHomePhabricator

Upgrade Grafana hosts to Bookworm
Closed, ResolvedPublic

Description

Upgrade Grafana Instances to Debian Bookworm

Overview

This task tracks the upgrade and details the upgrade steps of our Grafana instances to Debian Bookworm.

  • Active Host: grafana1002
  • Standby Host: grafana2001

Package Upgrade Requirements

The following table lists the Grafana-related packages to be upgraded, including their current installed versions and the target versions available upstream:

PackageInstalled VersionUpstream VersionCompatibility
grafanav9.4.14v10.2.3Yes
grafana-lokiv2.5.0v2.9.3Yes
grafana-pluginsv0.6NAYes
grizzlyv0.1.0v0.3.0Yes

1. Prerequisites

  • Set up a Bookworm host in Pontoon.
  • Confirm the Puppet catalog compiles without errors and all packages are available.
  • Validate general functionality of Grafana services.

2. Upgrade steps:

  1. On cumin2002:
    • Reimage:
      • $ sudo cookbook sre.hosts.reimage --os bookworm -t T352665 grafana2001
    • Verify services:
      • $ sudo cumin 'grafana2001*' 'systemctl is-active grafana-server'
      • $ sudo cumin 'grafana2001*' 'systemctl is-active grafana-loki'

2.2 Failover to grafana2001

  1. On cumin2002, stop services:
    • $ sudo cumin 'grafana2001*' 'systemctl stop grafana-server'
    • $ sudo cumin 'grafana2001*' 'systemctl stop grafana-loki'
  2. Sync data from active to passive host.
    • $ sudo cumin 'grafana2001*' 'sudo systemctl start rsync-var-lib-grafana'
    • $ sudo cumin 'grafana2001*' 'sudo systemctl start rsync-loki-data'
  3. On cumin2002, stop services:
    • $ sudo cumin 'grafana2001*' 'systemctl start grafana-server'
    • $ sudo cumin 'grafana2001*' 'systemctl start grafana-loki'
  4. Merge patches for failover.
    1. grafana: Failover from grafana1002 to grafana2001 (Change 992710).
    2. grafana: Ensure user traffic goes to grafana2001 (Change 992719).
  5. Run puppet on the Grafana hosts and verify service status:
    • Run Puppet:
      • $ sudo cumin 'A:grafana' 'run-puppet-agent'
      • $ sudo cumin 'A:cp' 'run-puppet-agent'
    • Verify services:
      • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-server'
      • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-loki'
  6. Access Grafana via web browser to confirm functionality.

2.3 Reimage Standby Host (grafana1002)

  1. Merge the following patch:
    • grafana: Create the grafana sysuser with a reserved UID/GID (Change 990795).
  2. On cumin2002:
    • Reimage:
      • $ sudo cookbook sre.hosts.reimage --os bookworm -t T352665 grafana1002
    • Verify services:
      • $ sudo cumin 'grafana1002*' 'systemctl is-active grafana-server'
      • $ sudo cumin 'grafana1002*' 'systemctl is-active grafana-loki'

2.4 Failover Back to grafana1002

  1. On cumin2002, stop services:
    • $ sudo cumin 'grafana1002*' 'systemctl stop grafana-server'
    • $ sudo cumin 'grafana1002*' 'systemctl stop grafana-loki'
  2. Sync data from active to passive host.
    • $ sudo cumin 'grafana1002*' 'sudo systemctl start rsync-var-lib-grafana'
    • $ sudo cumin 'grafana1002*' 'sudo systemctl start rsync-loki-data'
  3. On cumin2002, start services:
    • $ sudo cumin 'grafana1002*' 'systemctl start grafana-server'
    • $ sudo cumin 'grafana1002*' 'systemctl start grafana-loki'
  4. Merge patches for failover.
    1. Revert grafana: Failover from grafana2001 to grafana1002 (Change 992710).
    2. Revert grafana: Ensure user traffic goes to grafana1002 (Change 992719).
    3. Revert hieradata: move grafana-next from codfw to eqiad ([Change 1002569]https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002569)
  5. Run puppet on the Grafana hosts and verify service status:
    • Run Puppet:
      • $ sudo cumin 'A:grafana' 'run-puppet-agent'
      • $ sudo cumin 'A:cp' 'run-puppet-agent'
    • Verify services:
      • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-server'
      • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-loki'
  6. Access Grafana via web browser to confirm functionality.

3. Post-Upgrade Actions:

  • Document failover procedure on Wikitech. Failing over from the active to the passive host
  • Re-enable stunnel for data migration.
  • Upgrade grafana-loki to the latest version.
  • Upgrade grafana to the latest version.
  • Upgrade grizzly to the latest version.

4. Additional Notes

  • Compatibility confirmed for all required packages on Debian Bookworm.
  • Issue with rsync-var-lib-grafana.service daemon failing due to SSL chain verification on standby host.
  • Reported packaging issue to upstream with proposed patch for the Debian package to respect GRAFANA_HOME variable.
  • Observed grafana-loki.service failure on grafana2001: T357026.

Event Timeline

andrea.denisse changed the task status from Open to In Progress.Jan 10 2024, 4:55 PM

Change 989989 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] pontoon: Enroll pontoon-grafana-02

https://gerrit.wikimedia.org/r/989989

Change 989989 merged by Andrea Denisse:

[operations/puppet@production] pontoon: Enroll pontoon-grafana-02

https://gerrit.wikimedia.org/r/989989

Change 990795 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Create Grafana sysuser and home directory

https://gerrit.wikimedia.org/r/990795

andrea.denisse updated the task description. (Show Details)
andrea.denisse updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-01-16T09:04:20Z] <denisse> reprepro: Copy grafana v9.4.14 from buster to bookworm - T352665

Change 991386 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Ensure the grafana2001 hosts uses Puppet 7

https://gerrit.wikimedia.org/r/991386

Change 991386 merged by Andrea Denisse:

[operations/puppet@production] grafana: Ensure the grafana2001 hosts uses Puppet 7

https://gerrit.wikimedia.org/r/991386

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host grafana2001.codfw.wmnet with OS bookworm

Change 991391 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Ensure the grafana1002 host uses Puppet 7

https://gerrit.wikimedia.org/r/991391

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host grafana2001.codfw.wmnet with OS bookworm completed:

  • grafana2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401171702_denisse_1300406_grafana2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 991391 merged by Andrea Denisse:

[operations/puppet@production] grafana: Ensure the grafana1002 hosts uses Puppet 7

https://gerrit.wikimedia.org/r/991391

post reimage the rsync job is failing on grafana2001

Jan 18 09:20:05 grafana2001 systemd[1]: Starting rsync-var-lib-grafana.service - Transfer data periodically between hosts...
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: stunnel 5.68 on x86_64-pc-linux-gnu platform
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: stunnel 5.68 on x86_64-pc-linux-gnu platform
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: Compiled with OpenSSL 3.0.9 30 May 2023
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: Compiled with OpenSSL 3.0.9 30 May 2023
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: Running  with OpenSSL 3.0.11 19 Sep 2023
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: Running  with OpenSSL 3.0.11 19 Sep 2023
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: Threading:PTHREAD Sockets:POLL,IPv6,SYSTEMD TLS:ENGINE,OCSP,PSK,SNI Auth:LIBWRAP
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: Threading:PTHREAD Sockets:POLL,IPv6,SYSTEMD TLS:ENGINE,OCSP,PSK,SNI Auth
:LIBWRAP                                                                                                                                                       Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: Reading configuration from file /tmp/sync-ssl-wrapper.stunnel.conf.G3PnwnjV
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: Reading configuration from file /tmp/sync-ssl-wrapper.stunnel.conf.G3Pnw
njV                                                                                                                                                            Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: UTF-8 byte order mark not detected
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: UTF-8 byte order mark not detected
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: FIPS mode disabled
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: FIPS mode disabled
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG4[ui]: Service [stunnel] uses "verifyChain" without subject checks
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG4[ui]: Service [stunnel] uses "verifyChain" without subject checks
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG4[ui]: Use "checkHost" or "checkIP" to restrict trusted certificates
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG4[ui]: Use "checkHost" or "checkIP" to restrict trusted certificates
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[ui]: Configuration successful
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[ui]: Configuration successful
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[0]: Service [stunnel] accepted connection from unnamed socket
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[0]: Service [stunnel] accepted connection from unnamed socket
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[0]: s_connect: connected 2620:0:861:101:10:64:0:119:1873
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[0]: s_connect: connected 2620:0:861:101:10:64:0:119:1873
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[0]: Service [stunnel] connected remote server from 2620:0:860:101:10:192:0:16
0:56554                                                                                                                                                        Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[0]: Service [stunnel] connected remote server from 2620:0:860:101:10:192:0:160:56554
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG4[0]: CERT: Pre-verification error: self-signed certificate in certificate chain
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG4[0]: CERT: Pre-verification error: self-signed certificate in certificate chai
n                                                                                                                                                              Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG4[0]: Rejected by CERT at depth=1: CN=Puppet CA: palladium.eqiad.wmnet
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG4[0]: Rejected by CERT at depth=1: CN=Puppet CA: palladium.eqiad.wmnet
Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG3[0]: SSL_connect: ../ssl/statem/statem_clnt.c:1889: error:0A000086:SSL routines::certificate verify failed
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG3[0]: SSL_connect: ../ssl/statem/statem_clnt.c:1889: error:0A000086:SSL routine
s::certificate verify failed                                                                                                                                   Jan 18 09:20:05 grafana2001 stunnel[99011]: LOG5[0]: Connection closed/reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99011]: 2024.01.18 09:20:05 LOG5[0]: Connection closed/reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99007]: rsync: did not see server greeting
Jan 18 09:20:05 grafana2001 sync-var-lib-grafana[99007]: rsync error: error starting client-server protocol (code 5) at main.c(1863) [Receiver=3.2.7]
Jan 18 09:20:05 grafana2001 systemd[1]: rsync-var-lib-grafana.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jan 18 09:20:05 grafana2001 systemd[1]: rsync-var-lib-grafana.service: Failed with result 'exit-code'.
Jan 18 09:20:05 grafana2001 systemd[1]: Failed to start rsync-var-lib-grafana.service - Transfer data periodically between hosts.

Change 991542 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: temp disable rsync stunnel for puppet7 migration

https://gerrit.wikimedia.org/r/991542

Change 991542 merged by Filippo Giunchedi:

[operations/puppet@production] grafana: temp disable rsync stunnel for puppet7 migration

https://gerrit.wikimedia.org/r/991542

Change 991569 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: chown rsync'd files

https://gerrit.wikimedia.org/r/991569

Change 991569 merged by Filippo Giunchedi:

[operations/puppet@production] grafana: chown rsync'd files

https://gerrit.wikimedia.org/r/991569

Change 991573 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: deploy puppet dashboards as grafana/grafana

https://gerrit.wikimedia.org/r/991573

Change 991573 merged by Filippo Giunchedi:

[operations/puppet@production] grafana: deploy puppet dashboards as grafana/grafana

https://gerrit.wikimedia.org/r/991573

Change 992710 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Failover from grafana1002 to grafana2001

https://gerrit.wikimedia.org/r/992710

Change 992719 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Ensure user traffic goes to grafana2001

https://gerrit.wikimedia.org/r/992719

Currently used 'grafana' UIDs/GIDs:

UIDGID
grafana1002114121
grafana2001110118

The UID/GID needs to be updated on the grafana hosts.

  1. Backup /etc/passwd in case something breaks, ex. # cp /etc/passwd /etc/passwd.bak.
  2. Ensure the desired UID/GID are not taken by any user or groups, ex. # grep 929 /etc/passwd and # grep 929 /etc/group must print an empty line.
  3. See which processes are ran by the user that will have it's ID modified, ex. ps -u grafana.
  4. Stop daemons ran by that user ex. # systemctl stop grafana.
  5. Change UID and GID respectively: # usermod -u 929 grafana and # groupmod -g 929 grafana.
  6. Ensure the user has the desired UID and GID, ex. id grafana must print uid=929(grafana) gid=929(grafana) groups=929(grafana).
  7. Run the puppet agent to update the UID and GID of the files managed by puppet: # run-puppet-agent.
  8. There may be some files that are not managed by Puppet, these files need to have their UID/GID updated manually:
    • On grafana1002:
      • Change the UID of the dangling files ex. # find / -user 114 -exec chown --no-dereference grafana {} \;.
      • Change the GID of the dangling files ex. # find / -group 121 -exec chgrp --no-dereference grafana {} \;.
    • On grafana2001:
      • Change the UID of the dangling files ex. # find / -user 110 -exec chown --no-dereference grafana {} \;.
      • Change the GID of the dangling files ex. # find / -group 118 -exec chgrp --no-dereference grafana {} \;.
  9. Ensure that daemons stopped in step #4 are started again, ex. # systemctl start grafana.
  10. Ensure all services are operational, ex. # systemctl list-units --type=service.

Merging 990795 before reimaging grafana1002 removes the need to update UID/GID on 2 hosts.

Grafana Hosts Failover Process:

Failover from grafana1002 to grafana2001:

  1. Update UID/GID on grafana2001 based on the instructions detailed in Task T352665.
  2. Stop the grafana services on both hosts from cumin2002:
    • $ sudo cumin 'A:grafana' 'systemctl stop grafana-server'
    • $ sudo cumin 'A:grafana' 'systemctl stop grafana-loki'
  3. Synchronize data from the active to the passive host. From cumin2002:
    • $ sudo cumin 'grafana1002*' 'sudo systemctl restart rsync-var-lib-grafana.timer'
    • $ sudo cumin 'grafana1002*' 'sudo systemctl restart rsync-loki-data.timer'
  4. Merge the following patches:
    1. grafana: Failover from grafana1002 to grafana2001 (Gerrit Change 992710).
    2. grafana: Ensure user traffic goes to grafana2001 (Gerrit Change 992719).
  5. Run puppet on both hosts:
    • $ sudo cumin 'A:grafana' 'run-puppet-agent'
  6. Verify if services are running as expected:
    • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-server'
    • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-loki'
  7. Access Grafana via a web browser.

Reimage grafana2001:

  1. Merge the following patch:
  2. Reimage grafana1002 using the following command:
    • $ sudo cookbook sre.hosts.reimage --os bookworm -t T352665 grafana1002
  3. Ensure services are running as expected on grafana1002:
    • $ sudo cumin 'grafana1002*' 'systemctl is-active grafana-server'
    • $ sudo cumin 'grafana1002*' 'systemctl is-active grafana-loki'

Failover to grafana1002:

  1. Stop the grafana services on both hosts from cumin2002:
    • $ sudo cumin 'A:grafana' 'systemctl stop grafana-server'
    • $ sudo cumin 'A:grafana' 'systemctl stop grafana-loki'
  2. Synchronize data from the active to the passive host. From cumin2002:
    • $ sudo cumin 'grafana2001*' 'sudo systemctl restart rsync-var-lib-grafana.timer'
    • $ sudo cumin 'grafana2001*' 'sudo systemctl restart rsync-loki-data.timer'
  3. Merge the following patches:
    1. Revert for: grafana: Failover from grafana1002 to grafana2001 (Gerrit Change 992710).
    2. Revert for: grafana: Ensure user traffic goes to grafana2001 (Gerrit Change 992719).
    3. Revert for: grafana: temp disable rsync stunnel for puppet7 migration (Gerrit Change 991542).
    4. grafana: Enable stunnel for Loki data transfer (Gerrit Change 994999).
  4. Run puppet on both hosts:
    • $ sudo cumin 'A:grafana' 'run-puppet-agent'
  5. Verify if services are running as expected:
    • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-server'
    • $ sudo cumin 'A:grafana' 'systemctl is-active grafana-loki'
  6. Access Grafana via a web browser.

Procedure overall LGTM, a few notes:

  • we'll need to do a final manual rsync before the flip to make sure we have the latest changes
  • my understanding is that we'll be losing loki data since that is not rsync'd now, and I think it should
  • for the traffic change to be effective we'll have to wait either a puppet run on cp hosts or force a puppet run
  • we're not flipping loki destination to grafana2001 profile::logstash::collector::output_public_loki_host and I believe we should

Change 994786 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Ensure Loki data is synchronized across instances

https://gerrit.wikimedia.org/r/994786

Change 994999 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] grafana: Enable stunnel for Loki data transfer

https://gerrit.wikimedia.org/r/994999

Hello @fgiunchedi, I appreciate your input. I've successfully incorporated the rsync process into the upgrade procedure for both Grafana and Loki data.

I'm currently collaborating with the Traffic team to determine the necessary steps to implement these changes effectively within Trafficserver. I'll make sure to keep you updated on this and document the process.

Change 994786 merged by Andrea Denisse:

[operations/puppet@production] grafana: Ensure Loki data is synchronized across instances

https://gerrit.wikimedia.org/r/994786

Change 998285 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: use undef default for active and standby host defaults

https://gerrit.wikimedia.org/r/998285

Change 998285 merged by Cwhite:

[operations/puppet@production] profile: use undef default for active and standby host defaults

https://gerrit.wikimedia.org/r/998285

Mentioned in SAL (#wikimedia-operations) [2024-02-12T14:36:24Z] <denisse> starting Upgrade Grafana hosts to Bookworm - T352665

Change 992710 merged by Andrea Denisse:

[operations/puppet@production] grafana: Failover from grafana1002 to grafana2001

https://gerrit.wikimedia.org/r/992710

Change 992719 merged by Andrea Denisse:

[operations/puppet@production] grafana: Ensure user traffic goes to grafana2001

https://gerrit.wikimedia.org/r/992719

Mentioned in SAL (#wikimedia-operations) [2024-02-12T14:47:49Z] <denisse> Completed failover to grafana2001 - T352665

Change 1002569 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: move grafana-next from codfw to eqiad

https://gerrit.wikimedia.org/r/1002569

Change 1002569 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: move grafana-next from codfw to eqiad

https://gerrit.wikimedia.org/r/1002569

Mentioned in SAL (#wikimedia-operations) [2024-02-12T15:07:21Z] <denisse> Reimage Standby Host (grafana1002) - T352665

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host grafana1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host grafana1002.eqiad.wmnet with OS bookworm completed:

  • grafana1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402121519_denisse_2414318_grafana1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-02-12T15:36:34Z] <denisse> Failover Back to grafana1002 - T352665

Change 994999 merged by Andrea Denisse:

[operations/puppet@production] grafana: Enable stunnel for Loki data transfer

https://gerrit.wikimedia.org/r/994999

Change 990795 abandoned by Andrea Denisse:

[operations/puppet@production] grafana: Create the grafana sysuser with a reserved UID/GID

Reason:

Abandoning as it's no longer required.

https://gerrit.wikimedia.org/r/990795

Closing this task as resolved because the Bookworm upgrade is completed.

I created T357666 to track the packages upgrades.