Page MenuHomePhabricator

Upgrade es3 to Bullseye
Closed, ResolvedPublic

Description

  • es2034
  • es2029
  • es2027
  • es1034
  • es1031
  • es1028

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.

Change 756565 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es2034: Disable notifications

https://gerrit.wikimedia.org/r/756565

Change 756586 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es2029: Disable notifications

https://gerrit.wikimedia.org/r/756586

Change 756565 merged by Ladsgroup:

[operations/puppet@production] es2034: Disable notifications

https://gerrit.wikimedia.org/r/756565

Mentioned in SAL (#wikimedia-operations) [2022-01-24T13:06:00Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reimage for upgrade - T299911

Icinga downtime set by ladsgroup@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: reimage for upgrade - T299911

es2034.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-01-24T13:06:03Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reimage for upgrade - T299911

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es2034.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es2034.codfw.wmnet with OS bullseye executed with errors:

  • es2034 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es2034.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es2034.codfw.wmnet with OS bullseye completed:

  • es2034 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201241400_ladsgroup_1151_es2034.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 756586 merged by Ladsgroup:

[operations/puppet@production] es2029: Disable notifications

https://gerrit.wikimedia.org/r/756586

Mentioned in SAL (#wikimedia-operations) [2022-01-25T10:00:36Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: reimage for upgrade - T299911

Icinga downtime set by ladsgroup@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: reimage for upgrade - T299911

es2029.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-01-25T10:00:45Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: reimage for upgrade - T299911

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es2029.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es2029.codfw.wmnet with OS bullseye completed:

  • es2029 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201251001_ladsgroup_9377_es2029.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 756960 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es2027: Disable notifications

https://gerrit.wikimedia.org/r/756960

Change 756960 merged by Ladsgroup:

[operations/puppet@production] es2027: Disable notifications

https://gerrit.wikimedia.org/r/756960

Mentioned in SAL (#wikimedia-operations) [2022-01-25T10:52:55Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2027.codfw.wmnet with reason: reimage for upgrade - T299911

Icinga downtime set by ladsgroup@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: reimage for upgrade - T299911

es2027.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-01-25T10:52:59Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2027.codfw.wmnet with reason: reimage for upgrade - T299911

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es2027.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es2027.codfw.wmnet with OS bullseye completed:

  • es2027 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201251053_ladsgroup_26115_es2027.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-25T12:33:03Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool es1031 (T299911)', diff saved to https://phabricator.wikimedia.org/P19136 and previous config saved to /var/cache/conftool/dbconfig/20220125-123303-ladsgroup.json

Change 756971 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es1031: Disable notifications

https://gerrit.wikimedia.org/r/756971

Change 756971 merged by Ladsgroup:

[operations/puppet@production] es1031: Disable notifications

https://gerrit.wikimedia.org/r/756971

Mentioned in SAL (#wikimedia-operations) [2022-01-25T13:06:33Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: reimage for upgrade - T299911

Icinga downtime set by ladsgroup@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: reimage for upgrade - T299911

es1031.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-01-25T13:06:37Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: reimage for upgrade - T299911

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es1031.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es1031.eqiad.wmnet with OS bullseye completed:

  • es1031 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201251324_ladsgroup_5946_es1031.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-25T14:15:39Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1031 (T299911)', diff saved to https://phabricator.wikimedia.org/P19163 and previous config saved to /var/cache/conftool/dbconfig/20220125-141538-ladsgroup.json

Change 757008 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es1034: Disable notifications

https://gerrit.wikimedia.org/r/757008

Change 757008 merged by Ladsgroup:

[operations/puppet@production] es1034: Disable notifications

https://gerrit.wikimedia.org/r/757008

Mentioned in SAL (#wikimedia-operations) [2022-01-25T15:00:53Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1031 (T299911)', diff saved to https://phabricator.wikimedia.org/P19175 and previous config saved to /var/cache/conftool/dbconfig/20220125-150052-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-25T15:02:57Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling es1034 (T299911)', diff saved to https://phabricator.wikimedia.org/P19176 and previous config saved to /var/cache/conftool/dbconfig/20220125-150256-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es1034.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es1034.eqiad.wmnet with OS bullseye completed:

  • es1034 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201251520_ladsgroup_28989_es1034.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-25T15:56:04Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1034 (T299911)', diff saved to https://phabricator.wikimedia.org/P19193 and previous config saved to /var/cache/conftool/dbconfig/20220125-155604-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-25T16:41:19Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1034 (T299911)', diff saved to https://phabricator.wikimedia.org/P19204 and previous config saved to /var/cache/conftool/dbconfig/20220125-164118-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-25T16:43:25Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Make es1031 master of es3 T299911', diff saved to https://phabricator.wikimedia.org/P19206 and previous config saved to /var/cache/conftool/dbconfig/20220125-164324-ladsgroup.json

Change 757032 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es1028: Disable notifications

https://gerrit.wikimedia.org/r/757032

Change 757032 merged by Ladsgroup:

[operations/puppet@production] es1028: Disable notifications

https://gerrit.wikimedia.org/r/757032

Mentioned in SAL (#wikimedia-operations) [2022-01-25T16:49:00Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling es1028 (T299911)', diff saved to https://phabricator.wikimedia.org/P19208 and previous config saved to /var/cache/conftool/dbconfig/20220125-164900-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es1028.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es1028.eqiad.wmnet with OS bullseye completed:

  • es1028 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201251744_ladsgroup_24763_es1028.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-25T18:24:36Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1028 (T299911)', diff saved to https://phabricator.wikimedia.org/P19215 and previous config saved to /var/cache/conftool/dbconfig/20220125-182435-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-25T19:09:50Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1028 (T299911)', diff saved to https://phabricator.wikimedia.org/P19220 and previous config saved to /var/cache/conftool/dbconfig/20220125-190949-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-25T19:12:39Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Make es1028 master of es3 T299911', diff saved to https://phabricator.wikimedia.org/P19221 and previous config saved to /var/cache/conftool/dbconfig/20220125-191238-ladsgroup.json

Ladsgroup updated the task description. (Show Details)
Ladsgroup moved this task from Incoming to Done on the User-Ladsgroup board.
Ladsgroup moved this task from In progress to Done on the DBA board.