Page MenuHomePhabricator

Upgrade pc1 to Debian Bookworm and MariaDB 10.6
Closed, ResolvedPublic

Description

  • pc2011
  • pc1011

Event Timeline

Marostegui triaged this task as Medium priority.Nov 22 2023, 7:47 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 981710 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] ProductionServices.php: Promote pc1014 to pc1

https://gerrit.wikimedia.org/r/981710

Mentioned in SAL (#wikimedia-operations) [2023-12-11T05:34:46Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787

Mentioned in SAL (#wikimedia-operations) [2023-12-11T05:35:13Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787

Change 981710 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices.php: Promote pc1014 to pc1

https://gerrit.wikimedia.org/r/981710

Mentioned in SAL (#wikimedia-operations) [2023-12-11T05:37:14Z] <marostegui@deploy2002> Started scap: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-11T05:46:43Z] <marostegui@deploy2002> marostegui: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-12-11T05:54:09Z] <marostegui@deploy2002> Finished scap: Backport for [[gerrit:981710|ProductionServices.php: Promote pc1014 to pc1 (T351787)]] (duration: 16m 54s)

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1011.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1011.eqiad.wmnet with OS bookworm completed:

  • pc1011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312110610_marostegui_649898_pc1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-12-12T05:47:09Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787

Change 982206 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] ProductionServices.php: Promote pc2014 as master of pc1

https://gerrit.wikimedia.org/r/982206

Mentioned in SAL (#wikimedia-operations) [2023-12-12T05:47:27Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: pc1 master switch T351787

Change 982207 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc2011: Disable notifications

https://gerrit.wikimedia.org/r/982207

Change 982206 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices.php: Promote pc2014 as master of pc1

https://gerrit.wikimedia.org/r/982206

Change 982207 merged by Marostegui:

[operations/puppet@production] pc2011: Disable notifications

https://gerrit.wikimedia.org/r/982207

Mentioned in SAL (#wikimedia-operations) [2023-12-12T05:50:59Z] <marostegui@deploy2002> Started scap: Backport for [[gerrit:982206|ProductionServices.php: Promote pc2014 as master of pc1 (T351787)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-12T05:52:24Z] <marostegui@deploy2002> marostegui: Backport for [[gerrit:982206|ProductionServices.php: Promote pc2014 as master of pc1 (T351787)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-12-12T05:59:34Z] <marostegui@deploy2002> Finished scap: Backport for [[gerrit:982206|ProductionServices.php: Promote pc2014 as master of pc1 (T351787)]] (duration: 08m 35s)

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc2011.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc2011.codfw.wmnet with OS bookworm completed:

  • pc2011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312120621_marostegui_1465917_pc2011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Marostegui updated the task description. (Show Details)

This is all done