Page MenuHomePhabricator

Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan
Closed, ResolvedPublic

Description

Once we've completed T352909 we can move on to move lvs2012 from the legacy per-row vlan private1-b-codfw to the rack-specific one private1-b2-codfw.

This will also change the host from BGP peering with the two codfw core routers to the Leaf switch in rack B2.

The (updated) process will be roughly:

  1. Netbox: Reserve new IPs from private1-b2-codfw for lvs2012
  2. Puppet: Create patch that will (don't merge yet):
    1. Move previous primary IP to new vlan sub-interface of primary link
    2. Change the BGP peering for Pybal on the host to peer with the top-of-rack not CRs
    3. Add the newly-assigned primary IP to hierdata/common.yaml
  3. Downtime lvs2012, CRs and lsw1-b2-codfw
  4. Disable puppet on lvs2012
  5. Manually stop PyBal service on lvs2012
  6. Check grafana traffic graphs / connections to validate that traffic has moved to using backup lvs2014 and everything looking ok after 15-20 mins
  7. Netbox:
    1. Attach reserved IPs to primary interface of lvs, removing old ones
    2. Change untagged vlan for switch port from old to new vlan
    3. Add old vlan to list of tagged vlans on switch port
  8. Homer: Push new switch config
    1. lvs will now be unreachable via SSH - use idrac console if needed
  9. Run sre.dns.netbox cookbook to update primary DNS for lvs2012
  10. Run sre.dns.wipe-cache cookbook for lvs2012
  11. Manually disable BGP peering to lvs on lsw1-b2-codfw
  12. Merge patch created in step 2
  13. Start reimage of lvs2012
  14. Homer: run against CRs to remove old BGP peering
  15. When host re-imaged it shoould come back up on the new primary IP
    1. Check reachability is ok over new primary IP (v4/v6)
    2. Check reachbility is ok to old private1-b-codfw vlan (v4-v6)
  16. Manually enable BGP peering to lvs on lsw1-b2-codfw
  17. Check BGP establishes, watch grafana connection graphs to see traffic flip back from lvs2014, test connections

Event Timeline

Change 980948 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan

https://gerrit.wikimedia.org/r/980948

Icinga downtime and Alertmanager silence (ID=38432fab-1dd6-4ffe-a093-648c38675985) set by cmooney@cumin1002 for 2:00:00 on 5 host(s) and their services with reason: moving lvs hosts codfw T352784

cr[1-2]-codfw,cr[1-2]-codfw IPv6,lvs2013

Mentioned in SAL (#wikimedia-operations) [2024-01-16T15:55:18Z] <cmooney@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on re0.cr[1-2]-codfw.mgmt with reason: moving lvs hosts codfw T352784 T352918

Mentioned in SAL (#wikimedia-operations) [2024-01-16T15:55:32Z] <cmooney@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on re0.cr[1-2]-codfw.mgmt with reason: moving lvs hosts codfw T352784 T352918

Mentioned in SAL (#wikimedia-operations) [2024-01-16T17:11:07Z] <cmooney@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: moving lvs hosts codfw T352784 T352918

Mentioned in SAL (#wikimedia-operations) [2024-01-16T17:11:23Z] <cmooney@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: moving lvs hosts codfw T352784 T352918

Change 980948 abandoned by Cathal Mooney:

[operations/puppet@production] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan

Reason:

git has me confused!

https://gerrit.wikimedia.org/r/980948

Ok I abandoned my previous change as my git skills weren't up to it.

Prepping a new patch now, plan will be:

  1. Assign new IP from private1-b2-codfw to lvs2012 and attach to primary interface in Netbox
  2. Merge patch to
    1. Move previous primary IP to new vlan sub-interface of primary link
    2. Change the BGP peering for Pybal on the host to peer with the top-of-rack not CRs
    3. Add the newly-assigned primary IP to hierdata/common.yaml
  3. Downtime lvs2012
  4. Disable puppet on lvs2012
  5. Manually stop PyBal service on lvs2012
  6. Check grafana traffic graphs / connections to validate that traffic has moved to using backup lvs2014 and everything looking ok after 15-20 mins
  7. Change vlans for lsw1-b2-codfw port facing lvs2012 to change untagged vlan to rack-specific one and add old row-wide one to list of tagged vlans
  8. Push new switch config with Homer
    1. lvs will now be unreachable via SSH directly - use idrac console if needed
  9. Run sre.dns.netbox cookbook to update primary DNS for lvs2012
  10. Run Homer against Reconfigure CRs to remove old BGP peering
  11. Merge patch to move connection to private1-b-codfw to a sub-interface, and modify BGP peer for lvs2012 to use switch IP
  12. Start reimage of lvs2012
  13. Manually disable BGP peering to lvs on lsw1-b2-codfw
  14. When re-imaged it shoould come back up on the new primary IP
    1. Check reachability is ok over new primary IP (v4/v6)
    2. Check reachbility is ok to old private1-b-codfw vlan (v4-v6)
  15. Manually enable BGP peering to lvs on lsw1-b2-codfw
  16. Check BGP establishes, watch grafana connection graphs to see traffic flip back from lvs2014, test connections

Change 1007660 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan

https://gerrit.wikimedia.org/r/1007660

Icinga downtime and Alertmanager silence (ID=2521164e-4d59-47cb-8d79-8dd925725b87) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: Moving lvs2012 primary interface from private1-b-codfw to private1-b2-codfw

lvs2012.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-02-29T18:37:15Z] <topranks> disabling PyBal on lvs2012 to move traffic to lvs2014 ahead of reimage T352918

Icinga downtime and Alertmanager silence (ID=1baa9b0f-d917-4da6-83db-5cb28b50c97a) set by cmooney@cumin1002 for 2:00:00 on 2 host(s) and their services with reason: lvs moves to per-rack vlans

cr[1-2]-codfw
cmooney updated the task description. (Show Details)

Change 1007660 merged by Cathal Mooney:

[operations/puppet@production] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan

https://gerrit.wikimedia.org/r/1007660

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host lvs2012.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs2012.codfw.wmnet with OS bullseye completed:

  • lvs2012 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402291935_cmooney_3706646_lvs2012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Moved to new vlan and BGP established between server and switch now.

cmooney@lvs2012:/etc/pybal$ sudo ss -tulpna | grep ":179" | column -t 
tcp  LISTEN  0  50  10.192.11.8:179    0.0.0.0:*        users:(("pybal",pid=39657,fd=19))
tcp  ESTAB   0  0   10.192.11.8:60844  10.192.11.1:179  users:(("pybal",pid=39657,fd=18))
cmooney@lsw1-b2-codfw> show route table PRODUCTION.inet.0 receive-protocol bgp 10.192.11.8 terse    

PRODUCTION.inet.0: 1332 destinations, 2999 routes (1332 active, 0 holddown, 0 hidden)
  Prefix		  Nexthop	       MED     Lclpref    AS path
* 208.80.153.240/32       10.192.11.8          0                  64600 I
* 208.80.153.241/32       10.192.11.8          0                  64600 I
* 208.80.153.252/32       10.192.11.8          0                  64600 I

Routing looks ok from elsewhere:

cmooney@cumin1002:~$ sudo traceroute -I 208.80.153.240 
traceroute to 208.80.153.240 (208.80.153.240), 30 hops max, 60 byte packets
 1  ae4-1020.cr2-eqiad.wikimedia.org (10.64.48.3)  0.796 ms  0.737 ms  0.732 ms
 2  ae0.cr1-eqiad.wikimedia.org (208.80.154.193)  0.623 ms  0.619 ms  0.718 ms
 3  et-1-0-2.cr1-codfw.wikimedia.org (208.80.153.221)  30.613 ms  30.609 ms  30.604 ms
 4  irb-100.ssw1-a1-codfw.codfw.wmnet (10.192.254.5)  37.343 ms  37.339 ms  37.335 ms
 5  irb-2029.lsw1-b2-codfw.codfw.wmnet (10.192.11.1)  36.848 ms  36.844 ms  36.839 ms
 6  upload-lb.codfw.wikimedia.org (208.80.153.240)  30.495 ms  30.165 ms  30.151 ms

And traffic is going out to realservers on the old vlan by the new sub-interface fine:

cmooney@lvs2012:/etc/pybal$ sudo tcpdump -i vlan2018 -l -p -nn -c 10 
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vlan2018, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:59:24.872463 IP6 2603:8080:a406:7e8b:19ee:6138:fdef:e64a.54111 > 2620:0:860:ed1a::2:b.443: Flags [.], ack 4091538197, win 2048, options [nop,nop,TS val 4007338374 ecr 2443369631], length 0
20:59:24.872485 IP 159.118.228.122.42262 > 208.80.153.240.443: Flags [P.], seq 2796093617:2796093725, ack 1655924934, win 24569, options [nop,nop,TS val 5895331 ecr 1732233365], length 108
20:59:24.872512 IP 200.79.177.45.51817 > 208.80.153.240.443: Flags [P.], seq 2966277514:2966277549, ack 3810833920, win 8258, length 35
20:59:24.872513 IP 200.79.177.45.51817 > 208.80.153.240.443: Flags [P.], seq 35:70, ack 1, win 8258, length 35
20:59:24.872523 IP 181.50.102.115.4703 > 208.80.153.240.443: Flags [P.], seq 2702401834:2702402971, ack 283635665, win 157, options [nop,nop,TS val 2529029596 ecr 3010347040], length 1137
20:59:24.872540 IP 181.50.102.115.4703 > 208.80.153.240.443: Flags [P.], seq 1137:1168, ack 2897, win 168, options [nop,nop,TS val 2529029596 ecr 3010347051], length 31
20:59:24.872544 IP 187.173.247.242.62692 > 208.80.153.240.443: Flags [.], ack 2095834158, win 1993, options [nop,nop,TS val 2543516028 ecr 1960766974], length 0
20:59:24.872545 IP 189.90.55.78.62135 > 208.80.153.240.443: Flags [.], ack 1058445504, win 1026, length 0
20:59:24.872545 IP 187.173.247.242.62692 > 208.80.153.240.443: Flags [.], ack 1, win 2048, options [nop,nop,TS val 2543516028 ecr 1960766974], length 0
20:59:24.872586 IP6 2603:8080:a406:7e8b:19ee:6138:fdef:e64a.54111 > 2620:0:860:ed1a::2:b.443: Flags [.], ack 2655, win 2006, options [nop,nop,TS val 4007338379 ecr 2443369635], length 0
10 packets captured
1493 packets received by filter
0 packets dropped by kernel