Page MenuHomePhabricator

Upgrade es5 to Bullseye
Closed, ResolvedPublic

Description

Let's upgrade es4 to Bullseye.
es5 is RW not RO, so it does require a proper DB switchover.

  • es2025
  • es2024
  • es2023
  • es1025
  • es1024
  • es1023

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui renamed this task from Upgrade es4 to Bullseye to Upgrade es5 to Bullseye.Jan 25 2022, 9:55 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2022-01-26T10:25:45Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling es2025 (T300006)', diff saved to https://phabricator.wikimedia.org/P19264 and previous config saved to /var/cache/conftool/dbconfig/20220126-102445-ladsgroup.json

Change 757396 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es2025: Disable notifications

https://gerrit.wikimedia.org/r/757396

Change 757396 merged by Ladsgroup:

[operations/puppet@production] es2025: Disable notifications

https://gerrit.wikimedia.org/r/757396

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es2025.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es2025.codfw.wmnet with OS bullseye completed:

  • es2025 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201261028_ladsgroup_3126_es2025.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-26T11:24:39Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es2025 (T300006)', diff saved to https://phabricator.wikimedia.org/P19280 and previous config saved to /var/cache/conftool/dbconfig/20220126-112439-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-26T11:27:20Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es2025 (T300006)', diff saved to https://phabricator.wikimedia.org/P19281 and previous config saved to /var/cache/conftool/dbconfig/20220126-112719-ladsgroup.json

Change 757417 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es2024: Disable notifications

https://gerrit.wikimedia.org/r/757417

Change 757417 merged by Ladsgroup:

[operations/puppet@production] es2024: Disable notifications

https://gerrit.wikimedia.org/r/757417

Mentioned in SAL (#wikimedia-operations) [2022-01-26T11:36:26Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling es2024 (T300006)', diff saved to https://phabricator.wikimedia.org/P19284 and previous config saved to /var/cache/conftool/dbconfig/20220126-113626-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es2024.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es2024.codfw.wmnet with OS bullseye completed:

  • es2024 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201261141_ladsgroup_29141_es2024.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-26T12:38:40Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es2024 (T300006)', diff saved to https://phabricator.wikimedia.org/P19301 and previous config saved to /var/cache/conftool/dbconfig/20220126-123839-ladsgroup.json

Change 757446 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es1025: Disable notifications

https://gerrit.wikimedia.org/r/757446

Change 757446 merged by Marostegui:

[operations/puppet@production] es1025: Disable notifications

https://gerrit.wikimedia.org/r/757446

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1025.eqiad.wmnet with OS bullseye completed:

  • es1025 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201261507_marostegui_30725_es1025.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-27T10:47:36Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2024-2025].codfw.wmnet with reason: Reimage of the master T300006

Mentioned in SAL (#wikimedia-operations) [2022-01-27T10:47:41Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2024-2025].codfw.wmnet with reason: Reimage of the master T300006

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es2023.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es2023.codfw.wmnet with OS bullseye completed:

  • es2023 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201271050_ladsgroup_17459_es2023.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 757638 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es1023: Disable notifications

https://gerrit.wikimedia.org/r/757638

Change 757638 merged by Ladsgroup:

[operations/puppet@production] es1023: Disable notifications

https://gerrit.wikimedia.org/r/757638

Mentioned in SAL (#wikimedia-operations) [2022-01-27T12:06:49Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19444 and previous config saved to /var/cache/conftool/dbconfig/20220127-120648-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es1023.eqiad.wmnet with OS bullseye completed:

  • es1023 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201271345_ladsgroup_2000_es1023.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-01-27T14:25:18Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19475 and previous config saved to /var/cache/conftool/dbconfig/20220127-142517-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-27T15:10:32Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1023 (T300006)', diff saved to https://phabricator.wikimedia.org/P19483 and previous config saved to /var/cache/conftool/dbconfig/20220127-151032-ladsgroup.json

Now only the master left, we will do that maybe next week?

Now only the master left, we will do that maybe next week?

Take a look at the very same thing I have to do for es4: T300127

Mentioned in SAL (#wikimedia-operations) [2022-02-15T09:49:31Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T300006

Mentioned in SAL (#wikimedia-operations) [2022-02-15T09:49:36Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T300006

Mentioned in SAL (#wikimedia-operations) [2022-02-15T10:02:53Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set es1023 with weight 0 T300006', diff saved to https://phabricator.wikimedia.org/P20772 and previous config saved to /var/cache/conftool/dbconfig/20220215-100253-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-02-15T10:10:21Z] <Amir1> Starting es5 eqiad failover from es1024 to es1023 - T300006

Mentioned in SAL (#wikimedia-operations) [2022-02-15T10:14:13Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote es1023 to es5 primary and set section read-write T300006', diff saved to https://phabricator.wikimedia.org/P20776 and previous config saved to /var/cache/conftool/dbconfig/20220215-101412-root.json

Mentioned in SAL (#wikimedia-operations) [2022-02-15T10:18:18Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Setting weight to es1023 T300006', diff saved to https://phabricator.wikimedia.org/P20777 and previous config saved to /var/cache/conftool/dbconfig/20220215-101817-root.json

Change 762783 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es1024: Disable notifications

https://gerrit.wikimedia.org/r/762783

Change 762783 merged by Ladsgroup:

[operations/puppet@production] es1024: Disable notifications

https://gerrit.wikimedia.org/r/762783

Mentioned in SAL (#wikimedia-operations) [2022-02-15T11:04:20Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling es1024 (T300006)', diff saved to https://phabricator.wikimedia.org/P20781 and previous config saved to /var/cache/conftool/dbconfig/20220215-110420-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host es1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host es1024.eqiad.wmnet with OS bullseye completed:

  • es1024 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202151516_ladsgroup_21612_es1024.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-15T16:00:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1024 (T300006)', diff saved to https://phabricator.wikimedia.org/P20813 and previous config saved to /var/cache/conftool/dbconfig/20220215-160055-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-02-15T16:46:11Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance es1024 (T300006)', diff saved to https://phabricator.wikimedia.org/P20819 and previous config saved to /var/cache/conftool/dbconfig/20220215-164611-ladsgroup.json

Ladsgroup updated the task description. (Show Details)
Ladsgroup moved this task from Blocked to Done on the DBA board.