Page MenuHomePhabricator

Install Debian Bookworm on a DB
Closed, ResolvedPublic

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bookworm executed with errors:

  • db1124 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bullseye

There seem to be issues with the partitioning

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bullseye executed with errors:

  • db1124 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306190753_marostegui_2592974_db1124.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bookworm executed with errors:

  • db1124 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Change 931232 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] d-i: Fix retrieval of reuse-parts.sh for bookworm

https://gerrit.wikimedia.org/r/931232

Change 931232 merged by Muehlenhoff:

[operations/puppet@production] d-i: Fix retrieval of reuse-parts.sh for bookworm

https://gerrit.wikimedia.org/r/931232

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bookworm completed:

  • db1124 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306191004_marostegui_2619288_db1124.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-06-19T12:21:00Z] <moritzm> uploaded wmfmariadbpy 0.10+deb12u1 T339835

Change 931493 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db1124

https://gerrit.wikimedia.org/r/931493

Change 931493 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db1124

https://gerrit.wikimedia.org/r/931493

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1119.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1119.eqiad.wmnet with OS bookworm executed with errors:

  • db1119 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1119.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1119.eqiad.wmnet with OS bookworm completed:

  • db1119 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202306200729_marostegui_2855360_db1119.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202306200814_marostegui_2855360_db1119.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306200820_marostegui_2855360_db1119.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

I need to recloned db1119 to s1 so it can serve some traffic.

Change 953555 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1119 to s1

https://gerrit.wikimedia.org/r/953555

Change 953555 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1119 to s1

https://gerrit.wikimedia.org/r/953555

db1119 has been recloned and it is now on s1 (not serving traffic)

This has been done, the testing can be followed up on the parent task.

Change 966524 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] d-i: Fix retrieval of reuse-parts-test.sh for bookworm

https://gerrit.wikimedia.org/r/966524

Change 966524 merged by Elukey:

[operations/puppet@production] d-i: Fix retrieval of reuse-parts-test.sh for bookworm

https://gerrit.wikimedia.org/r/966524