Page MenuHomePhabricator

Upgrade parsercache infra to Bullseye
Open, MediumPublic

Description

Let's upgrade parsercache to Bullseye

pc1:

  • pc2014
  • pc2011
  • pc1014 (floating host)
  • pc1011

pc2:

  • pc2012
  • pc1012

pc3:

  • pc2013
  • pc1013

Event Timeline

Marostegui changed the task status from Open to Stalled.Wed, Jan 12, 12:39 PM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Blocked on the DBA board.

On hold until we are happy with the performance of pc1011 (T295965)

Marostegui changed the task status from Stalled to Open.Fri, Jan 14, 6:59 AM
Marostegui moved this task from Blocked to In progress on the DBA board.

I haven't seen anything relevant performance-wise on pc1011 so I think it is ok to go ahead and migrate our parsercache infra to Bullseye.

Change 753874 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc2012: Disable notifications

https://gerrit.wikimedia.org/r/753874

Change 753874 merged by Marostegui:

[operations/puppet@production] pc2012: Disable notifications

https://gerrit.wikimedia.org/r/753874

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc2012.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc2012.codfw.wmnet with OS bullseye completed:

  • pc2012 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201140705_marostegui_5677_pc2012.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 753912 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc2013: Disable notifications

https://gerrit.wikimedia.org/r/753912

Change 753912 merged by Marostegui:

[operations/puppet@production] pc2013: Disable notifications

https://gerrit.wikimedia.org/r/753912

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc2013.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc2013.codfw.wmnet with OS bullseye completed:

  • pc2013 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201140833_marostegui_17247_pc2013.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-14T09:11:17Z] <marostegui> Move pc1014 from pc1 to pc2 T299046

Change 753943 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc2011: Disable notifications

https://gerrit.wikimedia.org/r/753943

Change 753943 merged by Marostegui:

[operations/puppet@production] pc2011: Disable notifications

https://gerrit.wikimedia.org/r/753943

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc2011.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc2011.codfw.wmnet with OS bullseye completed:

  • pc2011 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201141259_marostegui_24589_pc2011.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

All codfw is now on Bullseye.

Change 754784 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc1014: Disable notifications

https://gerrit.wikimedia.org/r/754784

Change 754784 merged by Marostegui:

[operations/puppet@production] pc1014: Disable notifications

https://gerrit.wikimedia.org/r/754784

Change 754805 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc1014: Move it to pc2

https://gerrit.wikimedia.org/r/754805

Change 754805 merged by Marostegui:

[operations/puppet@production] pc1014: Move it to pc2

https://gerrit.wikimedia.org/r/754805

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1014.eqiad.wmnet with OS bullseye executed with errors:

  • pc1014 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

I have finished this reimage manually - but I am going to run it again to see why it could've failed, as it failed out quickly at:

Running Puppet with args --quiet --attempts 30 on 1 hosts: alert1001.wikimedia.org
----- OUTPUT of 'run-puppet-agent...et --attempts 30' -----
================
PASS |                                                                                       |   0% (0/1) [00:09<?, ?hosts/s]
FAIL |███████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:09<00:00,  9.41s/hosts]
100.0% (1/1) of nodes failed to execute command 'run-puppet-agent...et --attempts 30': alert1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'run-puppet-agent...et --attempts 30'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 466, in run
    self.downtime.get_runner(self.downtime.argument_parser().parse_args(downtime_args)).run()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/downtime.py", line 119, in run
    self.puppet.run(quiet=True, attempts=30)
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 193, in run
    self._remote_hosts.run_sync(Command(command, timeout=timeout), batch_size=batch_size)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 528, in run_sync
    print_progress_bars=print_progress_bars,
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
**The reimage failed, see the cookbook logs for the details**

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1014.eqiad.wmnet with OS bullseye executed with errors:

  • pc1014 (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

This happened again. and it is because puppet is failing on alert1001.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1014.eqiad.wmnet with OS bullseye completed:

  • pc1014 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201180709_marostegui_3787_pc1014.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 754864 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] ProductionServices.php: Promote pc1014 to pc2 master

https://gerrit.wikimedia.org/r/754864

Change 754865 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote pc1014 to pc2 master

https://gerrit.wikimedia.org/r/754865

Change 754865 merged by Marostegui:

[operations/puppet@production] mariadb: Promote pc1014 to pc2 master

https://gerrit.wikimedia.org/r/754865

Change 754864 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices.php: Promote pc1014 to pc2 master

https://gerrit.wikimedia.org/r/754864

Mentioned in SAL (#wikimedia-operations) [2022-01-18T08:30:28Z] <marostegui@deploy1002> Synchronized wmf-config/ProductionServices.php: Promote pc1014 to master in pc2 T299046 (duration: 00m 51s)

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1012.eqiad.wmnet with OS bullseye completed:

  • pc1012 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201180832_marostegui_15413_pc1012.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

pc1012 reimaged, I have configured its replication from pc1014 to get all the keys that were inserted during its reimage.

Mentioned in SAL (#wikimedia-operations) [2022-01-18T09:59:39Z] <marostegui@deploy1002> Synchronized wmf-config/ProductionServices.php: Revert: Promote pc1014 to master in pc2 T299046 (duration: 00m 50s)

Change 754871 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc1014: Move to pc3

https://gerrit.wikimedia.org/r/754871

Change 754871 merged by Marostegui:

[operations/puppet@production] pc1014: Move to pc3

https://gerrit.wikimedia.org/r/754871