Page MenuHomePhabricator

Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan
Closed, ResolvedPublic

Description

Once we've completed T352912 we can move on to move lvs2011 from the legacy per-row vlan private1-a-codfw to the rack-specific one private1-a2-codfw.

This will also change the host from BGP peering with the two codfw core routers to the Leaf switch in rack A2.

The (updated) process will be roughly:

  1. Netbox: Reserve new IPs from private1-a2-codfw for lvs2011
  2. Puppet: Create patch that will (don't merge yet):
    1. Move previous primary IP to new vlan sub-interface of primary link
    2. Change the BGP peering for Pybal on the host to peer with the top-of-rack not CRs
    3. Add the newly-assigned primary IP to hierdata/common.yaml
  3. Downtime lvs2011, CRs and lsw1-a2-codfw
  4. Disable puppet on lvs2011
  5. Manually stop PyBal service on lvs2011
  6. Check grafana traffic graphs / connections to validate that traffic has moved to using backup lvs2014 and everything looking ok after 15-20 mins
  7. Netbox:
    1. Attach reserved IPs to primary interface of lvs, removing old ones
    2. Change untagged vlan for switch port from old to new vlan
    3. Add old vlan to list of tagged vlans on switch port
  8. Homer: Push new switch config
    1. lvs will now be unreachable via SSH - use idrac console if needed
  9. Run sre.dns.netbox cookbook to update primary DNS for lvs2011
  10. Run sre.dns.wipe-cache cookbook for lvs2011
  11. Manually disable BGP peering to lvs on lsw1-a2-codfw
  12. Merge patch created in step 2
  13. Start reimage of lvs2011
  14. Homer: run against CRs to remove old BGP peering
  15. When host re-imaged it shoould come back up on the new primary IP
    1. Check reachability is ok over new primary IP (v4/v6)
    2. Check reachbility is ok to old private1-a-codfw vlan (v4-v6)
  16. Manually enable BGP peering to lvs on lsw1-a2-codfw
  17. Check BGP establishes, watch grafana connection graphs to see traffic flip back from lvs2014, test connections

Event Timeline

Change 980954 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan

https://gerrit.wikimedia.org/r/980954

Change 980954 abandoned by Cathal Mooney:

[operations/puppet@production] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan

Reason:

will submit new patch

https://gerrit.wikimedia.org/r/980954

Change rGERRIT1007703d0399 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan

https://gerrit.wikimedia.org/r/1007703

Icinga downtime and Alertmanager silence (ID=6010131f-b756-49c6-8082-62badba41292) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan

lvs2011.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=c0fe6035-a553-49f8-8b94-3d7840e51e64) set by cmooney@cumin1002 for 2:00:00 on 3 host(s) and their services with reason: moving lvs2011 which will disrupt bgp

cr[1-2]-codfw,lsw1-a2-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-03-05T17:10:15Z] <topranks> disabling pybal on lvs2011 (traffic will move to lvs2014) in advance of reimage T352920

Change rGERRIT1007703d0399 merged by Cathal Mooney:

[operations/puppet@production] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan

https://gerrit.wikimedia.org/r/1007703

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host lvs2011.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs2011.codfw.wmnet with OS bullseye completed:

  • lvs2011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403051813_cmooney_432966_lvs2011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Reimage looks good, BGP up and lvs2011 handling traffic again:

cmooney@cumin1002:~$ sudo traceroute -I 208.80.153.224 
traceroute to 208.80.153.224 (208.80.153.224), 30 hops max, 60 byte packets
 1  ae4-1020.cr2-eqiad.wikimedia.org (10.64.48.3)  0.475 ms  0.444 ms  0.535 ms
 2  ae0.cr1-eqiad.wikimedia.org (208.80.154.193)  0.428 ms  0.532 ms  0.526 ms
 3  et-1-0-2.cr1-codfw.wikimedia.org (208.80.153.221)  30.613 ms  30.607 ms  30.601 ms
 4  irb-100.ssw1-a1-codfw.codfw.wmnet (10.192.254.5)  35.888 ms  35.882 ms  35.876 ms
 5  irb-2017.lsw1-a2-codfw.codfw.wmnet (10.192.0.106)  31.317 ms  31.309 ms  31.303 ms
 6  text-lb.codfw.wikimedia.org (208.80.153.224)  30.481 ms  30.302 ms  30.258 ms