Page MenuHomePhabricator

codfw: move public baremetal servers to per rack vlan
Open, Needs TriagePublic

Description

Below is the list of baremetal servers still in the row wide vlans (private1-a/b/c/d-codfw).
As we're moving away from those vlans towards per rack vlans, the listed hosts will need to be moved and re-numbered into one of those vlans:

  • public1-b3-codfw : 208.80.153.128/28 - 2620:0:860:6::/64
  • public1-d3-codfw : 208.80.153.144/28 - 2620:0:860:7::/64
  • public1-e5-codfw : 208.80.153.160/28 - 2620:0:860:8::/64 - the first (and only) public vlan in pod EF

Ideally through a re-image (for the re-IP), but if that's not practical, we can assist manually updating the host's IPs.
If they're going to be replaced soon, they can probably be ignored.
If those hosts can be moved to a private IP and fronted with the CDN that would be even better

  • alert2002.wikimedia.org @andrea.denisse
  • bast2003.wikimedia.org @MoritzMuehlenhoff
  • cloudweb2002-dev.wikimedia.org @taavi
  • contint2002.wikimedia.org - scheduled for refresh in FY2627-Q4
  • contint2003.wikimedia.org
  • dns[2004-2006].wikimedia.org - Need special care to not cause traffic imbalance @ssingh
  • gerrit2002.wikimedia.org - scheduled for refresh in FY2627-Q4 @Jelto
  • gerrit2003.wikimedia.org @Jelto
  • gitlab2002.wikimedia.org - scheduled for refresh in FY2627-Q4 @Jelto
  • lists2001.wikimedia.org @Ladsgroup
  • netmon2002.wikimedia.org @andrea.denisse

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dns[2004-2006].wikimedia.org - Need special care to not cause traffic imbalance @ssingh

Do all three need to happen at the same time? Because that's a problem since ns1 is announced only from codfw so far. I presume no, but checking. Other than that, if we do one host at a time, there are no concerns at all. We can even do two hosts at a time, just so that at least one is up and announcing the ns1 IP. There are no concerns around rps and load.

@ssingh For the DNS servers, the ones peering with the core routers will have a higher priority (as-path) than the ones peering with the ToR switches. So if one can handle all the load then we're good as we won't have any redundancy issue.
We can use as-path prepending to fine tune it, but if we can avoid it it would be better.

Plan could be:

  • Move dns2004 - internet load will be shared on 2005/2006 (making 2004 roughly a hot standby, while still receiving 10.3.0.1 queries from its pod)
  • Test that 2004 works well
  • Move dns2005 - all the internet load will go to 2006
  • Shortly after, move dns2006 - all the internet load will be balanced between the 3 hosts again

Cloudweb hosts are in an interesting state: T411783 proposes moving those to the cloud racks (and so getting rid of the public IP requirement), while T392478 proposes getting rid of them entirely (and replacing with VMs that would need public IPs).

@ssingh For the DNS servers, the ones peering with the core routers will have a higher priority (as-path) than the ones peering with the ToR switches. So if one can handle all the load then we're good as we won't have any redundancy issue.
We can use as-path prepending to fine tune it, but if we can avoid it it would be better.

Plan could be:

  • Move dns2004 - internet load will be shared on 2005/2006 (making 2004 roughly a hot standby, while still receiving 10.3.0.1 queries from its pod)
  • Test that 2004 works well
  • Move dns2005 - all the internet load will go to 2006
  • Shortly after, move dns2006 - all the internet load will be balanced between the 3 hosts again

Yes, thanks, @ayounsi. One should be able to handle the load just fine. And we can quickly depool everything between them, so the above sounds good. Thanks for checking!