Page MenuHomePhabricator

Migrate role::graphite::production to Bullseye
Open, MediumPublic

Description

As per title, all hosts running Graphite should be on Bullseye

root@cumin1001:~# cumin 'P{C:graphite} and not P{F:lsbdistcodename = buster}'
2 hosts will be targeted:
graphite2003.codfw.wmnet,graphite1004.eqiad.wmnet
DRY-RUN mode enabled, aborting
  • Get the graphite::production role to work in Pontoon on Bullseye

Action plan for codfw:

The plan for eqiad is similar, with the addition of a failover to codfw as per https://wikitech.wikimedia.org/wiki/Graphite#Failover and fail back once things are working in eqiad.

Details

ProjectBranchLines +/-Subject
operations/deployment-chartsmaster+6 -1
operations/puppetproduction+1 -67
operations/mediawiki-configmaster+2 -2
operations/dnsmaster+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/dnsmaster+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+1 -0
operations/puppetproduction+16 -4
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+7 -1
operations/puppetproduction+1 -1
operations/puppetproduction+103 -0
operations/puppetproduction+21 -11
operations/puppetproduction+0 -5
operations/puppetproduction+7 -4
operations/puppetproduction+5 -0
Show related patches Customize query in gerrit

Event Timeline

Change 589576 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: django 2.2 compat

https://gerrit.wikimedia.org/r/589576

Change 589599 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Fix installation of graphite-web on Buster

https://gerrit.wikimedia.org/r/589599

Change 589576 abandoned by Filippo Giunchedi:
graphite: django 2.2 compat

Reason:
Per Moritz, fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/ /589599

https://gerrit.wikimedia.org/r/589576

Change 589599 merged by Muehlenhoff:
[operations/puppet@production] Fix installation of graphite-web on Buster

https://gerrit.wikimedia.org/r/589599

lmata renamed this task from Migrate role::graphite::production to Buster to Migrate role::graphite::production to Bullseye.Aug 31 2021, 4:19 PM
lmata triaged this task as Medium priority.Thu, Sep 30, 9:48 PM
lmata moved this task from Inbox to Up next on the SRE Observability (FY2021/2022-Q2) board.

Change 726612 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: add Bullseye support

https://gerrit.wikimedia.org/r/726612

Change 726613 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: add Bullseye version of graphite auth/index

https://gerrit.wikimedia.org/r/726613

Change 726614 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: stop using LVM for /srv in labs

https://gerrit.wikimedia.org/r/726614

Change 726614 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: stop using LVM for /srv in labs

https://gerrit.wikimedia.org/r/726614

Change 726612 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: add Bullseye support

https://gerrit.wikimedia.org/r/726612

Change 726613 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: add Bullseye version of graphite auth/index

https://gerrit.wikimedia.org/r/726613

Change 726750 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pontoon: use graphite-04 in o11y stack

https://gerrit.wikimedia.org/r/726750

Change 726750 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: use graphite-04 in o11y stack

https://gerrit.wikimedia.org/r/726750

A few roadblocks and bugs but overall progress, so far:

  • graphite-web isn't in stable, I've imported the testing version to bullseye-wikimedia
  • carbon-c-relay has a CPU-hogging bug in stable, I've imported the testing version to bullseye-wikimedia
  • statsite needed an update (upstream version and python3). It is a local package and a new version lives in bullseye-wikimedia

Change 727293 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsite: switch to python3 on Bullseye

https://gerrit.wikimedia.org/r/727293

Change 727294 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: set settings_module from uwsgi

https://gerrit.wikimedia.org/r/727294

Change 727295 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsite: log instance identifier

https://gerrit.wikimedia.org/r/727295

Change 727293 merged by Filippo Giunchedi:

[operations/puppet@production] statsite: switch to python3 on Bullseye

https://gerrit.wikimedia.org/r/727293

Change 727294 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: set settings_module from uwsgi

https://gerrit.wikimedia.org/r/727294

Change 727295 merged by Filippo Giunchedi:

[operations/puppet@production] statsite: log instance identifier

https://gerrit.wikimedia.org/r/727295

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet

Change 729934 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: use standard recipe for graphite2003

https://gerrit.wikimedia.org/r/729934

Change 729934 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: use standard recipe for graphite2003

https://gerrit.wikimedia.org/r/729934

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:

  • graphite2003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110913_filippo_4002_graphite2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:

  • graphite2003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110950_filippo_29925_graphite2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 729968 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: disable tags support

https://gerrit.wikimedia.org/r/729968

Change 729975 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: move production to /srv/carbon as storage directory

https://gerrit.wikimedia.org/r/729975

Change 729975 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: move production to /srv/carbon as storage directory

https://gerrit.wikimedia.org/r/729975

Change 730427 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: expire metric files not updated for 3y

https://gerrit.wikimedia.org/r/730427

Change 729968 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: disable tags support

https://gerrit.wikimedia.org/r/729968

Change 730427 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: expire metric files not updated for 3y

https://gerrit.wikimedia.org/r/730427

Mentioned in SAL (#wikimedia-operations) [2021-10-18T09:38:04Z] <godog> sync metrics from graphite1004 to graphite2003 - T247963

Change 731433 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd: failover writes to graphite2003

https://gerrit.wikimedia.org/r/731433

Change 731434 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] monitoring: check graphite2003 metrics

https://gerrit.wikimedia.org/r/731434

Change 731435 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/731435

Change 731436 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/731436

Change 731435 merged by Filippo Giunchedi:

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/731435

Mentioned in SAL (#wikimedia-operations) [2021-10-19T08:50:22Z] <godog> point graphite.discovery.wmnet to graphite2003 - T247963

Change 731434 merged by Filippo Giunchedi:

[operations/puppet@production] monitoring: check graphite2003 metrics

https://gerrit.wikimedia.org/r/731434

Mentioned in SAL (#wikimedia-operations) [2021-10-19T09:37:11Z] <godog> move graphite/statsd writes to graphite2003 - T247963

Change 731433 merged by Filippo Giunchedi:

[operations/puppet@production] statsd: failover writes to graphite2003

https://gerrit.wikimedia.org/r/731433

Change 731436 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/731436

Change 731917 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mwdebug: add graphite2003 to network policies

https://gerrit.wikimedia.org/r/731917

Change 731918 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/731918

Change 731918 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/731918

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:21:26Z] <oblivian@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:22:38Z] <oblivian@deploy1002> Synchronized tests/WmfConfigServicesTest.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:26Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:37Z] <godog> bounce navtiming on webperf1001 to pick up statsd changes - T247963

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:50:05Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963

Change 732273 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: use standard recipe for all graphite hosts

https://gerrit.wikimedia.org/r/732273

Change 732273 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: use standard recipe for all graphite hosts

https://gerrit.wikimedia.org/r/732273

Change 731917 merged by jenkins-bot:

[operations/deployment-charts@master] mwdebug: fix statsd network policy

https://gerrit.wikimedia.org/r/731917

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye completed:

  • graphite1004 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110210755_filippo_29322_graphite1004.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB