Page MenuHomePhabricator
Paste P53419

(An Untitled Masterwork)
ActivePublic

Authored by fnegri on Nov 14 2023, 2:30 PM.
Tags
None
Referenced Files
F41504982: raw-paste-data.txt
Nov 14 2023, 2:42 PM
F41504959: raw-paste-data.txt
Nov 14 2023, 2:30 PM
F41504957: raw-paste-data.txt
Nov 14 2023, 2:30 PM
Subscribers
None
fnegri@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bookworm -t T345811 cloudvirt1046
==> ATTENTION: destructive action for host: cloudvirt1046 Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go User input is: "go"
Management Password:
Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis power status Acquired lock for key /spicerack/locks/cookbooks/sre.hosts.reimage:cloudvirt1046: {'concurrency': 1, 'created': '2023-11-14 13:05:15.876438', 'owner': 'fnegri@cumin1
001 [548633]', 'ttl': 3600}
START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm Updated Phabricator task T345811
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: cloudvirt1046
[1/12, retrying in 10.00s] Unable to verify all hosts got downtimed: Some hosts are not yet downtimed: ['cloudvirt1046'] Created silence ID 4f862929-f1a9-4665-9914-360b826a7f65
Downtimed on Icinga/Alertmanager
Disabling Puppet with reason "Host reimage - fnegri@cumin1001 - T345811" on 1 hosts: cloudvirt1046.eqiad.wmnet ----- OUTPUT of 'disable-puppet "...n1001 - T345811"' -----
================
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.78s/hosts]FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'disable-puppet "...n1001 - T345811"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Disabled Puppet
----- OUTPUT of 'puppet node clea...1046.eqiad.wmnet' -----
Notice: Revoked certificate with serial 10338 Notice: Removing file Puppet::SSL::Certificate cloudvirt1046.eqiad.wmnet at '/var/lib/puppet/server/ssl/ca/signed/cloudvirt1046.eqiad.wmnet.pem'
cloudvirt1046.eqiad.wmnet
================ 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...1046.eqiad.wmnet'.
----- OUTPUT of 'puppet node deac...1046.eqiad.wmnet' -----
Submitted 'deactivate node' for cloudvirt1046.eqiad.wmnet with UUID 11afec7a-5fbc-4019-9d9b-77d55c643711 ================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...1046.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Removed from Puppet and PuppetDB if present and deleted any certificates
Removed host cloudvirt1046.eqiad.wmnet from Debmonitor
Removed from Debmonitor if present Acquired lock for key /spicerack/locks/modules/spicerack.dhcp.DHCP:eqiad: {'concurrency': 1, 'created': '2023-11-14 13:05:35.729836', 'owner': 'fnegri@cumin1001 [548
633]', 'ttl': 120}
----- OUTPUT of '/bin/echo 'Cmhvc...oudvirt1046.conf' ----- ================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/echo 'Cmhvc...oudvirt1046.conf'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. ----- OUTPUT of '/usr/local/sbin/...cludes -r commit' -----
2023-11-14 13:05:36,902 [INFO] Writing file /etc/dhcp/automation/proxies/ttyS0-115200.conf
2023-11-14 13:05:36,904 [INFO] Writing file /etc/dhcp/automation/proxies/ttyS1-115200.conf 2023-11-14 13:05:36,904 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-eqiad.conf
2023-11-14 13:05:36,905 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-ulsfo.conf
2023-11-14 13:05:36,906 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-codfw.conf 2023-11-14 13:05:36,907 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-esams.conf
2023-11-14 13:05:36,907 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-eqsin.conf
2023-11-14 13:05:36,908 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-drmrs.conf Internet Systems Consortium DHCP Server 4.4.1
Copyright 2004-2018 Internet Systems Consortium.
All rights reserved. For info, please visit https://www.isc.org/software/dhcp/
Config file: /etc/dhcp/dhcpd.conf
Database file: /var/lib/dhcp/dhcpd.leases PID file: /var/run/dhcpd.pid
2023-11-14 13:05:36,948 [INFO] dhcp config test passed! 2023-11-14 13:05:39,093 [INFO] reloaded isc-dhcp-server
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Released lock for key /spicerack/locks/modules/spicerack.dhcp.DHCP:eqiad: {'concurrency': 1, 'created': '2023-11-14 13:05:35.729836', 'owner': 'fnegri@cumin1001 [548
633]', 'ttl': 120} Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis bootparam set bootflag force_pxe options=reset
Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5
Forced PXE for next reboot Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis power status
Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis power cycle
Host rebooted via IPMI [1/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cloudvirt1046.eqiad.wmnet not found yet, keep polling
for it: uptime 23652256.08 > threshold 2.48
[2/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cloudvirt1046.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
...
[18/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cloudvirt1046.eqiad.wmnet not found yet, keep pollin$
for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
Found reboot since 2023-11-14 13:05:39.165522 for hosts cloudvirt1046.eqiad.wmnet
Host up (Debian installer)
Add puppet_version metadata to Debian installer
Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis bootparam set bootflag none options=reset
Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5
Running IPMI command: ipmitool -I lanplus -H cloudvirt1046.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5
Checked BIOS boot parameters are back to normal
[1/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cloudvirt1046.eqiad.wmnet not found yet, keep polling
for it: uptime 86.39 > threshold 2.23
[2/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cloudvirt1046.eqiad.wmnet not found yet, keep polling
for it: uptime 96.62 > threshold 12.46
...
[239/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cloudvirt1046.eqiad.wmnet not found yet, keep pollin
g for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
Acquired lock for key /spicerack/locks/modules/spicerack.dhcp.DHCP:eqiad: {'concurrency': 1, 'created': '2023-11-14 14:27:29.169343', 'owner': 'fnegri@cumin1001 [548
633]', 'ttl': 60}
----- OUTPUT of '/usr/local/sbin/...cludes -r commit' -----
2023-11-14 14:27:30,054 [INFO] Writing file /etc/dhcp/automation/proxies/ttyS0-115200.conf
2023-11-14 14:27:30,054 [INFO] Writing file /etc/dhcp/automation/proxies/ttyS1-115200.conf 2023-11-14 14:27:30,054 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-eqiad.conf
2023-11-14 14:27:30,055 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-ulsfo.conf
2023-11-14 14:27:30,055 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-codfw.conf
2023-11-14 14:27:30,055 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-esams.conf
2023-11-14 14:27:30,055 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-eqsin.conf
2023-11-14 14:27:30,056 [INFO] Writing file /etc/dhcp/automation/proxies/mgmt-drmrs.conf
Internet Systems Consortium DHCP Server 4.4.1
Copyright 2004-2018 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
Config file: /etc/dhcp/dhcpd.conf
Database file: /var/lib/dhcp/dhcpd.leases
PID file: /var/run/dhcpd.pid
2023-11-14 14:27:30,069 [INFO] dhcp config test passed!
2023-11-14 14:27:32,149 [INFO] reloaded isc-dhcp-server
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Released lock for key /spicerack/locks/modules/spicerack.dhcp.DHCP:eqiad: {'concurrency': 1, 'created': '2023-11-14 14:27:29.169343', 'owner': 'fnegri@cumin1001 [548
633]', 'ttl': 60}
Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 552, in wait_reboot_since
uptimes = self.uptime(print_progress_bars=print_progress_bars)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 587, in uptime
results = self.run_sync(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 496, in run_sync
return self._execute(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 702, in _execute
raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 242, in _run
raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 633, in run
self._install_os()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 435, in _install_os
self.remote_installer.wait_reboot_since(di_reboot_time, print_progress_bars=False)
File "/usr/lib/python3/dist-packages/wmflib/decorators.py", line 210, in wrapper
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 556, in wait_reboot_since
raise RemoteCheckError(
spicerack.remote.RemoteCheckError: Reboot for cloudvirt1046.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
**The reimage failed, see the cookbook logs for the details**
Reimage executed with errors:
- cloudvirt1046 (**FAIL**)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- **The reimage failed, see the cookbook logs for the details**
Updated Phabricator task T345811
Released lock for key /spicerack/locks/cookbooks/sre.hosts.reimage:cloudvirt1046: {'concurrency': 1, 'created': '2023-11-14 13:05:15.876438', 'owner': 'fnegri@cumin1
001 [548633]', 'ttl': 3600}
END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1046.eqiad.wmnet with OS bookworm