Page MenuHomePhabricator

Upgrade pc2 to Debian Bookworm and MariaDB 10.6
Closed, ResolvedPublic

Description

  • pc2012
  • pc1012

Event Timeline

Marostegui triaged this task as Medium priority.Mon, Nov 20, 8:02 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 976383 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] ProductionServices.php: Promote pc2014 to pc2 master

https://gerrit.wikimedia.org/r/976383

Change 976383 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices.php: Promote pc2014 to pc2 master

https://gerrit.wikimedia.org/r/976383

Change 976384 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc2012: Disable notifications

https://gerrit.wikimedia.org/r/976384

Mentioned in SAL (#wikimedia-operations) [2023-11-22T06:15:27Z] <marostegui@deploy2002> Started scap: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]]

Change 976384 merged by Marostegui:

[operations/puppet@production] pc2012: Disable notifications

https://gerrit.wikimedia.org/r/976384

Icinga downtime and Alertmanager silence (ID=58af808d-b84d-4626-a401-d36ece63d937) set by marostegui@cumin1001 for 2:00:00 on 4 host(s) and their services with reason: Switch

pc[2012,2014].codfw.wmnet,pc[1012,1014].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-11-22T06:16:50Z] <marostegui@deploy2002> marostegui: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-22T06:22:56Z] <marostegui@deploy2002> Finished scap: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]] (duration: 07m 28s)

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc2012.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc2012.codfw.wmnet with OS bookworm completed:

  • pc2012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311220644_marostegui_1222028_pc2012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 978365 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] ProductionServices.php: Promote pc1014 to pc2

https://gerrit.wikimedia.org/r/978365

Change 978366 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc1012: Disable notifications

https://gerrit.wikimedia.org/r/978366

Icinga downtime and Alertmanager silence (ID=e9c495e7-c71c-488c-84a5-80cbeeab64f2) set by marostegui@cumin1001 for 2:00:00 on 3 host(s) and their services with reason: Switch

pc2012.codfw.wmnet,pc[1012,1014].eqiad.wmnet

Change 978366 merged by Marostegui:

[operations/puppet@production] pc1012: Disable notifications

https://gerrit.wikimedia.org/r/978366

Change 978365 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices.php: Promote pc1014 to pc2

https://gerrit.wikimedia.org/r/978365

Mentioned in SAL (#wikimedia-operations) [2023-11-29T07:15:52Z] <marostegui@deploy2002> Started scap: Backport for [[gerrit:978365|ProductionServices.php: Promote pc1014 to pc2 (T351620)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-29T07:17:17Z] <marostegui@deploy2002> marostegui: Backport for [[gerrit:978365|ProductionServices.php: Promote pc1014 to pc2 (T351620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-29T07:25:18Z] <marostegui@deploy2002> Finished scap: Backport for [[gerrit:978365|ProductionServices.php: Promote pc1014 to pc2 (T351620)]] (duration: 09m 25s)

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1012.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1012.eqiad.wmnet with OS bookworm completed:

  • pc1012 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311290743_marostegui_1468539_pc1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB