Page MenuHomePhabricator

Reimage x2 eqiad master
Closed, ResolvedPublic

Description

This requires a temporary switchover from db1152 to db1151

  • Self note: Check read only when db1152 comes back

Event Timeline

Marostegui triaged this task as Medium priority.Jun 5 2024, 9:33 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-06-05T09:34:20Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 4:00:00 on 6 hosts with reason: Reimage x2 eqiad master T366677

Mentioned in SAL (#wikimedia-operations) [2024-06-05T09:34:38Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: Reimage x2 eqiad master T366677

Mentioned in SAL (#wikimedia-operations) [2024-06-05T09:35:07Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1151 to temp x2 eqiad master T366677', diff saved to https://phabricator.wikimedia.org/P64077 and previous config saved to /var/cache/conftool/dbconfig/20240605-093507-root.json

Change #1039183 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1152: Disable notifications

https://gerrit.wikimedia.org/r/1039183

Change #1039183 merged by Marostegui:

[operations/puppet@production] db1152: Disable notifications

https://gerrit.wikimedia.org/r/1039183

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db1152.eqiad.wmnet with OS bookworm

Mentioned in SAL (#wikimedia-operations) [2024-06-05T10:10:20Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1152 back to x2 eqiad master T366677', diff saved to https://phabricator.wikimedia.org/P64086 and previous config saved to /var/cache/conftool/dbconfig/20240605-101019-root.json

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db1152.eqiad.wmnet with OS bookworm completed:

  • db1152 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406050954_marostegui_673539_db1152.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB