Page MenuHomePhabricator

Migrate lvs101[345] to lvs101[789]
Closed, ResolvedPublic

Description

The replacement LVSes were racked in T295804. This task is about going through the migration process to replace each of the old LVSes with a new one, which includes moving physical cables over for the non-primary interfaces.

lvs1016 was already migrated to lvs1020 before the holidays (its patches are in the earlier-linked task). We'll do the other three under this ticket.

  • lvs1013 -> lvs1017 (high-traffic1)
  • lvs1014 -> lvs1018 (high-traffic2)
  • lvs1015 -> lvs1019 (low-traffic)
  • lvs1016 -> lvs1020 (secondary; done in other task)

Event Timeline

BBlack updated the task description. (Show Details)

Change 761697 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1017 interface/role setup

https://gerrit.wikimedia.org/r/761697

Mentioned in SAL (#wikimedia-operations) [2022-02-10T18:43:53Z] <bblack> lvs1013 - stopping puppet+pybal for move to lvs1017, high-traffic1 traffic fails over to lvs1020 for now - T301142

Change 761697 merged by BBlack:

[operations/puppet@production] lvs1017 interface/role setup

https://gerrit.wikimedia.org/r/761697

Mentioned in SAL (#wikimedia-operations) [2022-02-10T19:11:04Z] <bblack> lvs1017 rebooting for sanity-check after prod config - T301142

Change 761704 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1017: fix public1-b/c interface ordering

https://gerrit.wikimedia.org/r/761704

Change 761704 merged by BBlack:

[operations/puppet@production] lvs1017: fix public1-b/c interface ordering

https://gerrit.wikimedia.org/r/761704

Mentioned in SAL (#wikimedia-operations) [2022-02-10T19:25:37Z] <bblack> lvs1017 reboot again for clean network config - T301142

Change 761709 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/homer/public@master] Add lvs1017 to pybal neighbors

https://gerrit.wikimedia.org/r/761709

Change 761709 merged by BBlack:

[operations/homer/public@master] Add lvs1017 to pybal neighbors

https://gerrit.wikimedia.org/r/761709

Change 761725 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1016: clean up unused hieradata

https://gerrit.wikimedia.org/r/761725

Change 761726 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1013: deconfigure towards spare::system

https://gerrit.wikimedia.org/r/761726

Change 761725 merged by BBlack:

[operations/puppet@production] lvs1016: clean up unused hieradata

https://gerrit.wikimedia.org/r/761725

Change 761726 merged by BBlack:

[operations/puppet@production] lvs1013: deconfigure towards spare::system

https://gerrit.wikimedia.org/r/761726

Change 761728 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/homer/public@master] Remove lvs1013 from pybal neighbors

https://gerrit.wikimedia.org/r/761728

Change 761728 merged by BBlack:

[operations/homer/public@master] Remove lvs1013 from pybal neighbors

https://gerrit.wikimedia.org/r/761728

Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1013.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1013.eqiad.wmnet with OS buster completed:

  • lvs1013 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202102203_bblack_353_lvs1013.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

lvs1013 -> lvs1017 is complete, including cleanup.

Since the process is tricky to get right and is a corner case for so much of our automation, I've documented it loosely in etherpad for now, to help with the next two: https://etherpad.wikimedia.org/p/LVS-Migration

(note that LVS replacements are rare, but they do happen. We might want to clean this up further on the next two runs, and then put it in Wikitech for whichever poor soul has to do the next one, e.g. codfw hw refresh).

Change 762852 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1018 interface/role setup

https://gerrit.wikimedia.org/r/762852

Mentioned in SAL (#wikimedia-operations) [2022-02-15T16:26:07Z] <bblack> lvs1014 - downtimed - stopping puppet+pybal to fail traffic over to lvs1020 - T301142

Change 762852 merged by BBlack:

[operations/puppet@production] lvs1018 interface/role setup

https://gerrit.wikimedia.org/r/762852

Mentioned in SAL (#wikimedia-operations) [2022-02-15T17:08:53Z] <bblack> cr[12]-eqiad: manual edit static fallback route for high-traffic2 from lvs1014 to lvs1018 - T301142

Change 762874 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/homer/public@master] Remove lvs1014 from pybal neighbors

https://gerrit.wikimedia.org/r/762874

Change 762876 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1014: unconfigure towards spare::system

https://gerrit.wikimedia.org/r/762876

Change 762874 merged by BBlack:

[operations/homer/public@master] Remove lvs1014 from pybal neighbors

https://gerrit.wikimedia.org/r/762874

Change 762876 merged by BBlack:

[operations/puppet@production] lvs1014: unconfigure towards spare::system

https://gerrit.wikimedia.org/r/762876

Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1014.eqiad.wmnet with OS buster

Change 762888 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1019 interface/role setup

https://gerrit.wikimedia.org/r/762888

Mentioned in SAL (#wikimedia-operations) [2022-02-15T18:05:41Z] <bblack> lvs1015 - stopping puppet+pybal to begin transition to lvs1019 - T301142

Change 762888 merged by BBlack:

[operations/puppet@production] lvs1019 interface/role setup

https://gerrit.wikimedia.org/r/762888

Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1014.eqiad.wmnet with OS buster completed:

  • lvs1014 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202151748_bblack_5466_lvs1014.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-15T18:41:52Z] <bblack> lvs1019 - disable puppet/pybal, reboot - T301142

Change 762903 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/homer/public@master] Add lvs1019 to pybal neighbors

https://gerrit.wikimedia.org/r/762903

Change 762903 merged by BBlack:

[operations/homer/public@master] Add lvs1019 to pybal neighbors

https://gerrit.wikimedia.org/r/762903

Change 762930 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/homer/public@master] Remove lvs1015 from pybal neighbors

https://gerrit.wikimedia.org/r/762930

Change 762931 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1015: unconfigure towards spare::system

https://gerrit.wikimedia.org/r/762931

Change 762930 merged by BBlack:

[operations/homer/public@master] Remove lvs1015 from pybal neighbors

https://gerrit.wikimedia.org/r/762930

Change 762931 merged by BBlack:

[operations/puppet@production] lvs1015: unconfigure towards spare::system

https://gerrit.wikimedia.org/r/762931

Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1015.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1015.eqiad.wmnet with OS buster completed:

  • lvs1015 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202151929_bblack_4611_lvs1015.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 762941 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Fix section label in smokeping config

https://gerrit.wikimedia.org/r/762941

Change 762941 merged by BBlack:

[operations/puppet@production] Fix section label in smokeping config

https://gerrit.wikimedia.org/r/762941