Page MenuHomePhabricator

Migrate role::graphite::production to Bullseye
Closed, ResolvedPublic

Description

As per title, all hosts running Graphite should be on Bullseye

root@cumin1001:~# cumin 'P{C:graphite} and not P{F:lsbdistcodename = buster}'
2 hosts will be targeted:
graphite2003.codfw.wmnet,graphite1004.eqiad.wmnet
DRY-RUN mode enabled, aborting
  • Get the graphite::production role to work in Pontoon on Bullseye

Action plan for codfw:

The plan for eqiad is similar, with the addition of a failover to codfw as per https://wikitech.wikimedia.org/wiki/Graphite#Failover and fail back once things are working in eqiad.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+10 -0
operations/mediawiki-configmaster+2 -2
operations/dnsmaster+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/dnsmaster+2 -2
operations/puppetproduction+4 -6
operations/deployment-chartsmaster+6 -1
operations/puppetproduction+1 -67
operations/mediawiki-configmaster+2 -2
operations/dnsmaster+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/dnsmaster+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+1 -0
operations/puppetproduction+16 -4
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+7 -1
operations/puppetproduction+1 -1
operations/puppetproduction+103 -0
operations/puppetproduction+21 -11
operations/puppetproduction+0 -5
operations/puppetproduction+7 -4
operations/puppetproduction+5 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lmata renamed this task from Migrate role::graphite::production to Buster to Migrate role::graphite::production to Bullseye.Aug 31 2021, 4:19 PM
lmata triaged this task as Medium priority.Sep 30 2021, 9:48 PM
lmata moved this task from Inbox to Up next on the SRE Observability (FY2021/2022-Q2) board.

Change 726612 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: add Bullseye support

https://gerrit.wikimedia.org/r/726612

Change 726613 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: add Bullseye version of graphite auth/index

https://gerrit.wikimedia.org/r/726613

Change 726614 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: stop using LVM for /srv in labs

https://gerrit.wikimedia.org/r/726614

Change 726614 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: stop using LVM for /srv in labs

https://gerrit.wikimedia.org/r/726614

Change 726612 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: add Bullseye support

https://gerrit.wikimedia.org/r/726612

Change 726613 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: add Bullseye version of graphite auth/index

https://gerrit.wikimedia.org/r/726613

Change 726750 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pontoon: use graphite-04 in o11y stack

https://gerrit.wikimedia.org/r/726750

Change 726750 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: use graphite-04 in o11y stack

https://gerrit.wikimedia.org/r/726750

A few roadblocks and bugs but overall progress, so far:

  • graphite-web isn't in stable, I've imported the testing version to bullseye-wikimedia
  • carbon-c-relay has a CPU-hogging bug in stable, I've imported the testing version to bullseye-wikimedia
  • statsite needed an update (upstream version and python3). It is a local package and a new version lives in bullseye-wikimedia

Change 727293 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsite: switch to python3 on Bullseye

https://gerrit.wikimedia.org/r/727293

Change 727294 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: set settings_module from uwsgi

https://gerrit.wikimedia.org/r/727294

Change 727295 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsite: log instance identifier

https://gerrit.wikimedia.org/r/727295

Change 727293 merged by Filippo Giunchedi:

[operations/puppet@production] statsite: switch to python3 on Bullseye

https://gerrit.wikimedia.org/r/727293

Change 727294 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: set settings_module from uwsgi

https://gerrit.wikimedia.org/r/727294

Change 727295 merged by Filippo Giunchedi:

[operations/puppet@production] statsite: log instance identifier

https://gerrit.wikimedia.org/r/727295

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet

Change 729934 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: use standard recipe for graphite2003

https://gerrit.wikimedia.org/r/729934

Change 729934 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: use standard recipe for graphite2003

https://gerrit.wikimedia.org/r/729934

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:

  • graphite2003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110913_filippo_4002_graphite2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:

  • graphite2003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110950_filippo_29925_graphite2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 729968 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: disable tags support

https://gerrit.wikimedia.org/r/729968

Change 729975 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: move production to /srv/carbon as storage directory

https://gerrit.wikimedia.org/r/729975

Change 729975 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: move production to /srv/carbon as storage directory

https://gerrit.wikimedia.org/r/729975

Change 730427 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: expire metric files not updated for 3y

https://gerrit.wikimedia.org/r/730427

Change 729968 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: disable tags support

https://gerrit.wikimedia.org/r/729968

Change 730427 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: expire metric files not updated for 3y

https://gerrit.wikimedia.org/r/730427

Mentioned in SAL (#wikimedia-operations) [2021-10-18T09:38:04Z] <godog> sync metrics from graphite1004 to graphite2003 - T247963

Change 731433 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd: failover writes to graphite2003

https://gerrit.wikimedia.org/r/731433

Change 731434 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] monitoring: check graphite2003 metrics

https://gerrit.wikimedia.org/r/731434

Change 731435 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/731435

Change 731436 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/731436

Change 731435 merged by Filippo Giunchedi:

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/731435

Mentioned in SAL (#wikimedia-operations) [2021-10-19T08:50:22Z] <godog> point graphite.discovery.wmnet to graphite2003 - T247963

Change 731434 merged by Filippo Giunchedi:

[operations/puppet@production] monitoring: check graphite2003 metrics

https://gerrit.wikimedia.org/r/731434

Mentioned in SAL (#wikimedia-operations) [2021-10-19T09:37:11Z] <godog> move graphite/statsd writes to graphite2003 - T247963

Change 731433 merged by Filippo Giunchedi:

[operations/puppet@production] statsd: failover writes to graphite2003

https://gerrit.wikimedia.org/r/731433

Change 731436 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/731436

Change 731917 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mwdebug: add graphite2003 to network policies

https://gerrit.wikimedia.org/r/731917

Change 731918 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/731918

Change 731918 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/731918

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:21:26Z] <oblivian@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:22:38Z] <oblivian@deploy1002> Synchronized tests/WmfConfigServicesTest.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:26Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:37Z] <godog> bounce navtiming on webperf1001 to pick up statsd changes - T247963

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:50:05Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963

Change 732273 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: use standard recipe for all graphite hosts

https://gerrit.wikimedia.org/r/732273

Change 732273 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: use standard recipe for all graphite hosts

https://gerrit.wikimedia.org/r/732273

Change 731917 merged by jenkins-bot:

[operations/deployment-charts@master] mwdebug: fix statsd network policy

https://gerrit.wikimedia.org/r/731917

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye completed:

  • graphite1004 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110210755_filippo_29322_graphite1004.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 734224 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: bump fetch_timeout

https://gerrit.wikimedia.org/r/734224

Change 734225 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: set CLUSTER_SERVERS empty with no remote servers

https://gerrit.wikimedia.org/r/734225

Change 734277 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] Revert \"discovery: move read traffic to graphite2003\"

https://gerrit.wikimedia.org/r/734277

Change 734278 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Revert \"statsd: failover writes to graphite2003\"

https://gerrit.wikimedia.org/r/734278

Change 734279 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Revert \"monitoring: check graphite2003 metrics\"

https://gerrit.wikimedia.org/r/734279

Change 734280 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] Revert \"wmnet: move writes to graphite2003\"

https://gerrit.wikimedia.org/r/734280

Change 734281 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/mediawiki-config@master] Revert \"ProductionServices: use graphite2003 for statsd\"

https://gerrit.wikimedia.org/r/734281

Change 734225 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: set CLUSTER_SERVERS empty with no remote servers

https://gerrit.wikimedia.org/r/734225

Change 734277 merged by Filippo Giunchedi:

[operations/dns@master] Revert \"discovery: move read traffic to graphite2003\"

https://gerrit.wikimedia.org/r/734277

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:27:13Z] <godog> move read traffic back to graphite1004 - T247963

Change 734279 merged by Filippo Giunchedi:

[operations/puppet@production] Revert \"monitoring: check graphite2003 metrics\"

https://gerrit.wikimedia.org/r/734279

Change 734278 merged by Filippo Giunchedi:

[operations/puppet@production] Revert \"statsd: failover writes to graphite2003\"

https://gerrit.wikimedia.org/r/734278

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:40:19Z] <godog> flip back write traffic to graphite1004 (all but mediawiki) - T247963

Change 734280 merged by Filippo Giunchedi:

[operations/dns@master] Revert \"wmnet: move writes to graphite2003\"

https://gerrit.wikimedia.org/r/734280

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:47:13Z] <godog> bounce navtiming on webperf1001 to pick up statsd changes - T247963

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:49:16Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:49:24Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963

Change 734281 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert \"ProductionServices: use graphite2003 for statsd\"

https://gerrit.wikimedia.org/r/734281

fgiunchedi claimed this task.
fgiunchedi added a subscriber: Joe.

This is complete! Both graphite2003 and graphite1004 run with Bullseye, the failover documentation is up to date. Thanks @Joe for the assistance with mw config deploys.

Change 734224 abandoned by Filippo Giunchedi:

[operations/puppet@production] graphite: bump fetch_timeout

Reason:

Not needed

https://gerrit.wikimedia.org/r/734224

Mentioned in SAL (#wikimedia-operations) [2021-12-09T03:37:10Z] <cwhite> bounce superset on an-tool1010 and 1005 to pick up statsd changes T247963

Mentioned in SAL (#wikimedia-operations) [2022-11-30T09:30:06Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963

Mentioned in SAL (#wikimedia-operations) [2022-11-30T09:32:48Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963