Page MenuHomePhabricator

Fix static IP fallbacks to Pybal LVS routes
Closed, ResolvedPublic

Description

During the cr1-eqiad JunOS upgrade, it was discovered that the static IP fallbacks to the Pybal BGP-learned LVS service IPs were stale and obsolete (e.g. they included mobile-lb, did not include a few newer ones or any of the 10/8 service IPs).

The stales are currently deactivated in cr1-eqiad and active (but wrong) on cr2-eqiad. Other sites' (codfw/esams/ulsfo) are likely also stale.

We should immediately fix these to restore our protection from a e.g. Pybal software bug. Moreover, we should fix a routing bug by adding those static routes as non-redistributed routes (but making sure that the supernets are, so that reachability from other sites still works).

Finally, these have been obsoleted a few times, so we should be doing a better job, either process-wise, implementation-wise or both, to make sure this doesn't happen again (and isn't broken at the time where we need it). @BBlack was investigating whether we could use the service IP supernets instead (and last I heard, he said it's possible with minimal amounts of reshuffling). The longer-term plan for this may include Pybal modifications and/or LVS rearchitecturing, and end up in a different task altogether; let's use this for now, though.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Yes - I think in eqiad we only need to reshuffle git-ssh.wikimedia.org, ocg.svc.eqiad.wmnet, and our internal recdns IPs. I think it's likely in the other DCs the situation is at least that good if not better, and no changes to major public production IPs for heavy traffic. This would effectively allow us to define (going forward) that high-traffic1 == first half of public LVS subnet, high-traffic2 == second half of public LVS subnet, low-traffic == relevant 10/8 service networks.

Next step here is actually auditing this at all DCs and making a list of re-shuffling steps we need to step through before we can change the router configs.

Proposed subnet mapping:

  • eqiad
    • high-traffic1 (lvs1001 + lvs1004)
      • 208.80.154.224/28 (224-239)
      • 2620:0:861:ed1a::0:0/111 (::0:0 - ::1:ffff)
    • high-traffic2 (lvs1002 + lvs1005)
      • 208.80.154.240/28 (240-255)
      • 2620:0:861:ed1a::2:0/111 (::2:0 - ::3:ffff)
    • low-traffic (lvs1003 + lvs1006)
      • 10.2.2.0/24
  • codfw
    • high-traffic1 (lvs2001 + lvs2004)
      • 208.80.153.224/28 (224-239)
      • 2620:0:860:ed1a::0:0/111 (::0:0 - ::1:ffff)
    • high-traffic2 (lvs2002 + lvs2005)
      • 208.80.153.240/28 (240-255)
      • 2620:0:860:ed1a::2:0/111 (::2:0 - ::3:ffff)
    • low-traffic (lvs2003 + lvs2006)
      • 10.2.1.0/24
  • esams
    • high-traffic1 (lvs3001 + lvs3003)
      • 91.198.174.192/28 (192-207)
      • 2620:0:862:ed1a::0:0/111 (::0:0 - ::1:ffff)
    • high-traffic2 (lvs3002 + lvs3004)
      • 91.198.174.208/28 (208-223)
      • 2620:0:862:ed1a::2:0/111 (::2:0 - ::3:ffff)
  • ulsfo
    • high-traffic1 (lvs4001 + lvs4003)
      • 198.35.26.96/28 (96-111)
      • 2620:0:863:ed1a::0:0/111 (::0:0 - ::1:ffff)
    • high-traffic2 (lvs4002 + lvs4004)
      • 198.35.26.112/28 (112-127)
      • 2620:0:863:ed1a::2:0/111 (::2:0 - ::3:ffff)

I could find out, but since you've already done the investigation: do we need to renumber or relocate any IPs for this scheme to work? If so, which?

Audit result: All datacenters already obey the mapping above, except for 3x exceptions in eqiad:

  • ocg.svc.eqiad.wmnet - currently in high-traffic2, should be in low-traffic
  • git-ssh.eqiad.wikimedia.org - currently in low-traffic, should be in high-traffic2
  • dns-rec-lb.eqiad.wikimedia.org - currently in high-traffic2, should be in high-traffic1 (or change its IP)

Change 315927 had a related patch set uploaded (by BBlack):
eqiad recdns IP fix: add new to DNS

https://gerrit.wikimedia.org/r/315927

Change 315928 had a related patch set uploaded (by BBlack):
eqiad recdns IP fix: remove old from DNS

https://gerrit.wikimedia.org/r/315928

Change 315929 had a related patch set uploaded (by BBlack):
eqiad recdns IP fix: add new address (.254)

https://gerrit.wikimedia.org/r/315929

Change 315930 had a related patch set uploaded (by BBlack):
eqiad recdns IP fix: switch in puppet

https://gerrit.wikimedia.org/r/315930

Change 315931 had a related patch set uploaded (by BBlack):
eqiad recdns IP fix: remove old from LVS

https://gerrit.wikimedia.org/r/315931

I cleaned up the old, stale backup routes and re-added backup routes for all of the subnets mentioned above across core routers in all datacenters. The backup routes always point to the first of an LVS pair (lvs1001, lvs2001, etc.).

I also double-checked the exceptions (for which I did *not* add backup routes) and can confirm the three @BBlack mentioned (ocg, git-ssh, dns-rec-lb). For this reason alone, I'm not resolving this task yet (@BBlack, if you think this work is going to be more of a long-term thing, we should probably open a different task or three tracking it).

Moreover, there are additionally these exceptions to the rule: ns0 (IPv4/IPv6), ns1 (IPv4/IPv6), ns2 (IPv6). These already have more-specific static routes on the router level which override the backup routes, but they're still worth mentioning. Perhaps we might want to fix these (by passing them through LVS) even before the anycast work.

Besides thre three + NS, I can confirm there aren't currently any other exceptions. Any ideas on how to make sure that we will never deviate from that subnet allocation in the future? Would puppet comments be enough? We also need to remember to change those routes when we (finally) replace the LVS servers in eqiad.

Moreover, there are additionally these exceptions to the rule: ns0 (IPv4/IPv6), ns1 (IPv4/IPv6), ns2 (IPv6). These already have more-specific static routes on the router level which override the backup routes, but they're still worth mentioning. Perhaps we might want to fix these (by passing them through LVS) even before the anycast work.

Yes: T101525

Besides thre three + NS, I can confirm there aren't currently any other exceptions. Any ideas on how to make sure that we will never deviate from that subnet allocation in the future? Would puppet comments be enough? We also need to remember to change those routes when we (finally) replace the LVS servers in eqiad.

Could we (after the exceptions are fixed) limit the subnets the routers will accept from PyBal's BGP advertisements, such that trying to configure a service on the "wrong" class would fail?

@faidon - re: eqiad recdns IPv4 - I've uploaded DNS and puppet patches to switch that IP (by turning on the new IP first in parallel with the old, then changing config everywhere). The downsides of this approach is we have to wait a while in the middle of the transition before everything is switched over (with salt commands to fix unpuppetized install-time config probably, too?), and monitor/fix any that are stuck on the old recdns IP (including, probably, manually configured recdns in eqiad network/power/mgmt devices?)

The alternative is we leave the IP as it is and move it to a different traffic class (move it from one LVS box to another). The ugly things about that solution are:

  1. We'll probably need to hack static routes in the routers to appropriate lvs instances temporarily and do some crazy manual deploy process to make the switch of the current IP to new LVS boxes without local recdns downtime in eqiad for all the clients.
  2. We'll end up in a state where codfw+esams have their recdns in high-traffic2 and eqiad has it in high-traffic1 (not necessarily a problem, just an annoying inconsistency).

Thoughts?

Oh, one other minor thing: the eqiad recdns IPv6 is already in the correct subnet for where it's at (matches with changing the IPv4 as the patches do, the first solution). If we go with leaving the IP alone and moving LVSes, we'll have to change its IPv6 too (but should be easy, I don't think much if anything is actually hitting recdns over IPv6).

The network hardware I can configure in one swoop, so don't worry about that. Not sure if the PDUs/iDRACs/iLOs have any DNS configured whatsoever, but we could find out with a tcpdump on the recursors for hits coming from the mgmt space.

Purely for consistency reasons, I'd move it to high-traffic2, but I have no strong preference, either is fine with me.

Change 315929 merged by BBlack:
eqiad recdns IP fix: add new address (.254)

https://gerrit.wikimedia.org/r/315929

Change 316920 had a related patch set uploaded (by BBlack):
LVS: move ocg to low-traffic set

https://gerrit.wikimedia.org/r/316920

Change 315930 merged by BBlack:
eqiad recdns IP fix: switch in puppet

https://gerrit.wikimedia.org/r/315930

Change 315927 merged by BBlack:
eqiad recdns IP fix: add new to DNS

https://gerrit.wikimedia.org/r/315927

Change 315928 merged by BBlack:
eqiad recdns IP fix: remove old from DNS

https://gerrit.wikimedia.org/r/315928

Change 315931 merged by BBlack:
eqiad recdns IP fix: remove old from LVS

https://gerrit.wikimedia.org/r/315931

The recdns case is fully-fixed now (the old/bad IP no longer present anywhere or functional).

Change 318081 had a related patch set uploaded (by BBlack):
revdns: document LVS traffic classes

https://gerrit.wikimedia.org/r/318081

Change 318081 merged by BBlack:
revdns: document LVS traffic classes

https://gerrit.wikimedia.org/r/318081

Change 316920 merged by BBlack:
LVS: move ocg to low-traffic set

https://gerrit.wikimedia.org/r/316920

Mentioned in SAL (#wikimedia-operations) [2016-10-26T12:05:09Z] <bblack> moving ocg LVS from high-traffic2 -> low-traffic - T143915

Mentioned in SAL (#wikimedia-operations) [2016-10-26T12:19:40Z] <bblack> moving git-ssh LVS from low-traffic -> high-traffic2 - T143915

ocg and git-ssh are fixed as well!

Change 318083 had a related patch set uploaded (by BBlack):
LVS: document subnets in balancer assignment

https://gerrit.wikimedia.org/r/318083

Change 318083 merged by BBlack:
LVS: document subnets in balancer assignment

https://gerrit.wikimedia.org/r/318083

BBlack claimed this task.