Page MenuHomePhabricator

Remove static routes for LVS VIPs from core routers
Open, LowPublic

Description

Our LVS servers announce /32 and /128 IPs for services via BGP to our core routers, using PyBal.

At this point this is a well understood and robust setup. However on our core routers, for instance in the below case in Eqiad, still have static-routes for IPs for the aggregate blocks we use for LVS services:

cmooney@re0.cr1-eqiad> show configuration routing-options static | display set | match "10.64.1.13|10.64.17.14|10.64.33.15"    
set routing-options static route 208.80.154.224/28 next-hop 10.64.1.13
set routing-options static route 208.80.154.240/28 next-hop 10.64.17.14
set routing-options static route 10.2.2.0/24 next-hop 10.64.33.15
cmooney@re0.cr1-eqiad> show configuration routing-options rib inet6.0 static | display set | match "10.64.1.13|10.64.17.14|10.64.33.15" 
set routing-options rib inet6.0 static route 2620:0:861:ed1a::0:0/111 next-hop 2620:0:861:101:10:64:1:13
set routing-options rib inet6.0 static route 2620:0:861:ed1a::2:0/111 next-hop 2620:0:861:102:10:64:17:14

To my knowledge this has existed since PyBal was initially deployed, and is intended to act as a "backup" route should BGP die on all the available LVS machines.

Since that time we have had a lot of experience with PyBal and know it to be robust and the overall LVS setup to work well. With that in mind there is an open question as to whether these static routes are needed at all.

Myself and @ayounsi spoke about it and are of the opinion they can be safely removed. We see that as a good move because:

  • Their presence in the config adds unnecessary complication.
  • They have made automating the static route configuration difficult.
  • There is a risk as we move / migrate LVS servers these are not updated, and some unexpected edge-case occurrs.
  • There is no compelling story about what type of scenario these protect against, it does not seem like they have ever "saved" us in an incident.

@BBlack interested to hear your thoughts on this (or anyone else who may feel they are a good idea to keep). Thanks!

  • eqiad:
  • codfw: removed
  • esams: nothing to do
  • ulsfo: removed
  • eqsin: removed
  • drmrs: nothing to do
  • magru: nothing to do

Event Timeline

cmooney triaged this task as Low priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I can fill in the scenario/story part a bit! For background:

  • Technically, LVS and pybal are separate things running on the same server. LVS is the kernel mechanism for routing the actual traffic, and pybal is more like command and control software which takes config + etcd + healthchecks as inputs and then controls the runtime configuration of the LVS routing, and also does the runtime BGP adverts of the configured services (at specified MED).
  • If pybal crashes/stops/dies, BGP adverts die with it, but the LVS config remains in its last-known-good state: it will continue routing any traffic received, and we've merely lost the further ability to automatically change backend server pooling based on etcd/healthchecks.
  • Without static routes, if pybal suddenly stops/dies on a single LVS, traffic fails over to the remaining alternate LVS which is still advertising the same IPs at a less-preferred MED (the secondary LVS in current config, which is the shared backup of the other primary LVSes, since MED is set per-daemon rather than per-service).
  • So based on this, the state of affairs in which static routes help is when both of the possible pybal advertisers are down, but the static-route still points at an otherwise live server with a working (if possibly slightly-stale) LVS runtime config for routing the traffic.

The realistic stories leading to that state (where the static route usefully saves us an outage) would be things like:

  • If there's an error (human or otherwise, or I supposed bad timing) in LVS service deployment steps, which involves stopping and starting both of the pybals for a given service. Usually the process is roughly "disable puppet on both LVSes; merge change; puppet backup LVS; restart pybal on backup LVS (flaps backup BGP route); puppet primary LVS; restart pybal on primary LVS (flaps primary BGP route, temporarily blipping traffic over to secondary)". This whole sequence critically relies on both of the pybal restarts going smoothly, and that we notice any failure and back out carefully. If the first pybal restart fails completely and nobody notices, or it it takes an inordinate amount of time to re-establish BGP, and especially if the second pybal restart doesn't go smoothly afterwards, we could end up in a scenario where there's no live BGP advert and the static route keeps the traffic alive until everyone's noticed the various alerts and sorted things out or rolled back the change, etc.
  • During normal runtime (no service changes being deployed) - we could face multiple pybals "organically" failing completely at roughly the same time, and thus losing all adverts to the otherwise-working LVS routing. Possibly imaginary scenarios would include things like:
    • Some kind of Y2K-like scenario where pybal or some dependency falls over dead on multiple servers as the clock rolls to some special value (seems unlikely!)
    • A bug in pybal's etcd integration, as etcd itself is a shared dependency? We could imagine a software change (etcd version update), a state-change (some confctl command), or some un-predicted meta-state-change (in the state of affairs of the etcd database/API) that triggers a bug in both pybals' etcd code, causing them to crash and then fail to restart successfully (due to the same bug). This one seems a little more likely than the last.
    • [probably some other similar scenarios if you think about it hard enough, but they'd all come down to a pybal bug/crash induced by an event/input shared by both LVSes]
  • During a DoS-like event with semi-overwhelmed LVSes, it's possible that the pybal BGP adverts get torn down (because it's falling behind on healthchecks due to CPU/network issues?) at a lower threshold than the LVS traffic routing itself dies. In these scenarios the site's traffic is semi-broken anyways, but it's entirely possible the static routes save some windows of some of the traffic from total failure. Also sometimes in these scenarios we have routes flapping back and forth between primary/secondary LVSes as various things temporarily-fail, and static routes may help cover some cracks in the timing of BGP adverts flapping, and might save some of the legit traffic that would otherwise have no route.

I don't have any specific memory of a past incident in which the static route saved us (but my memory is often faulty!). I wouldn't be surprised if it was a temporarily-helpful factor in some (but probably didn't completely prevent the incident). If it's been covering for us in some mis-timed/managed LVS service deploys, we probably wouldn't be aware that we had relied on it, assuming the normal BGP adverts eventually recovered on their own in reasonable time.

Weighing this against the costs of maintaining them properly, that's the big question here.

I can fill in the scenario/story part a bit! For background:

  • Without static routes, if pybal suddenly stops/dies on a single LVS, traffic fails over to the remaining alternate LVS which is still advertising the same IPs at a less-preferred MED (the secondary LVS in current config, which is the shared backup of the other primary LVSes, since MED is set per-daemon rather than per-service).

Just being pedantic here, the above is true with or without the static routes. The static routes are the backup plan of the backup plan as Brandon very nicely says below.

  • So based on this, the state of affairs in which static routes help is when both of the possible pybal advertisers are down, but the static-route still points at an otherwise live server with a working (if possibly slightly-stale) LVS runtime config for routing the traffic.

I don't have any specific memory of a past incident in which the static route saved us (but my memory is often faulty!). I wouldn't be surprised if it was a temporarily-helpful factor in some (but probably didn't completely prevent the incident). If it's been covering for us in some mis-timed/managed LVS service deploys, we probably wouldn't be aware that we had relied on it, assuming the normal BGP adverts eventually recovered on their own in reasonable time.

I have a memory of such an incident. Probably cause I was the cause of the incident. Years ago (don't judge me, I was ignorant and naive), I restarted both pybals on the primary and secondary LVSes with a rather short time window (a few secs), between the 2 restarts. BGP on the routers went temporarily into a somewhat hairy state while reconverging. We did have elevated errors for a short amount of time, but also the static routes appeared to have saved us from "some" (don't ask me to quantify this, I am unable to) of the aftermath. This falls under the "there's an error (human or otherwise, or I supposed bad timing)" story above, specifically a version of the *bad timing* part.

For the human-generated part that seems easy to prevent automating the process via a cookbook that can have all the checks and fail safes needed. My 2 cents.

For the human-generated part that seems easy to prevent automating the process via a cookbook that can have all the checks and fail safes needed. My 2 cents.

Absolutely! It's actually a pretty good example of a cookbook. LVS restarts is a daunting task for new hires. They are doing it living in fear they will break everything. Providing a less error-prone process would go a long way into alleviating that issue.

Re-reading my reply, I realized I may appear pro having those static routes (I am actually not) whereas my intent was to just provide a data point.

As Service Operations SRE, we do our numbers of pybal restarts whenever we introduce a new service to the infrastructure (or remove one). Which is a few (1-3?) times per quarter. So, as far as day to day goes, to use the terminology of the Chicken and the Pig fable[1] we are involved, but not committed and thus wouldn't want to impose designs or solutions on anyone. What we care at the end of the day, is that incoming traffic flows adequately to the various clusters.

As far as the static routes themselves go, aside from the incident I outlined above, which again, I can't quantify, I don't have any other recollection where they were actively helpful in the last 8 years.

[1] https://en.wikipedia.org/wiki/The_Chicken_and_the_Pig

Thanks for the feedback!

Weighing this against the costs of maintaining them properly, that's the big question here.

Indeed :)

I opened T334166: Abstract LVS restart using cookbook based on the comments above.

If I understand correctly LVS re-images are coming up (cc @ssingh ). While they're not changing IPs it's probably a good opportunity to audit the static routes and maybe improve them if we're not getting rid of them.

Picking codfw randomly:

ipv6
static {
    /* high-traffic1 - backup route */
    route 2620:0:860:ed1a::0:0/111 {
        next-hop 2620:0:860:101:10:192:1:7;
        readvertise;
        no-resolve;
    }
    /* high-traffic2 - backup route */
    route 2620:0:860:ed1a::2:0/111 {
        next-hop 2620:0:860:102:10:192:49:7;
        readvertise;
        no-resolve;
    }
}

Here 2620:0:860:102:10:192:49:7 isn't a valid IP, lvs2010 is 2620:0:860:104:10:192:49:7

We should also merge those to have a single /110 pointing to the backup LVS.

ipv4
/* high-traffic1 - backup route */
route 208.80.153.224/28 {
    next-hop 10.192.1.7;
    readvertise;
    no-resolve;                     
}
/* high-traffic2 - backup route */
route 208.80.153.240/28 {
    next-hop 10.192.49.7;
    readvertise;
    no-resolve;
}
/* low-traffic - backup route */
route 10.2.1.0/24 {
    next-hop 10.192.33.7;
    readvertise;
    no-resolve;
}

Similarly here with merging the two /28 in a /27 and having them all (including low-traffic) pointing to the backup LVS.

If we keep them it would be useful here to find how to cleanly implement them in Homer (they're still manual so far).
The prefixes are in Netbox, but the LVS primary/secondary/backup roles are (afaik) only in Puppet. Re-declaring them in Homer's YAML file might be a (less preferred) option too.

That codfw error is interesting actually, it makes me wonder why we have the "no-resolve" command on those routes?

Without that the error would cause the route to be considered invalid due to the next-hop being unreachable at layer-2 (no ND). But with that command the route goes into the routing table anyway.

In practical terms it probably makes no difference (if the route is not in the table, or it is but next-hop doesn't exist, traffic won't work if there is no BGP route). But to me that command is not one that I can imagine a reason we need here.

If we keep them it would be useful here to find how to cleanly implement them in Homer (they're still manual so far).

This may not be a bad idea. Obviously statics are not really something we want to have, but I think inevitably you sometimes need them, so working on some generic structure to automate statics might be a good idea.

Hi folks: Picking this ticket up as part of Traffic cleaning up our stuff.

It seems the current static routes -- even if they matter at this stage -- are incorrect so I am just wondering if we should remove them at this stage, or should we figure out a better way to keep them updated. I am all ears but it's been a while since we discussed it (two years?) so I am just checking if things have changed.

cr1-eqiad:

/* high-traffic1 - backup route */
route 208.80.154.224/28 {
    next-hop 10.64.0.80;
    readvertise;
    no-resolve;
}
/* high-traffic2 - backup route */
route 208.80.154.240/28 {
    next-hop 10.64.16.60;
    readvertise;
    no-resolve;
}
sukhe@cumin1003:~$ dig -x 10.64.0.80 +short
aux-k8s-worker1008.eqiad.wmnet.
sukhe@cumin1003:~$ dig -x 10.64.16.60 +short
lvs1018.eqiad.wmnet.

So the high-traffic1 backup route is not correct. Furthermore,

sukhe@re0.cr1-eqiad> show route 208.80.154.224 

inet.0: 990543 destinations, 4261213 routes (989819 active, 0 holddown, 1765 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

208.80.154.224/32  *[BGP/170] 10w1d 18:41:09, MED 0, localpref 100
                      AS path: 64600 I, validation-state: unverified
                    >  to 10.64.0.136 via ae1.1017
                    [BGP/170] 10w1d 18:41:08, MED 0, localpref 100, from 208.80.154.197
                      AS path: 64600 I, validation-state: unverified
                    >  to 208.80.154.194 via ae0.0
                       to 185.15.59.145 via xe-3/0/7.13
                    [BGP/170] 7w0d 22:17:51, MED 100, localpref 70
                      AS path: 64600 I, validation-state: unverified
                    >  to 10.64.48.72 via ae4.1020

mgmt_junos.inet.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[Static/5] 15w1d 15:00:02
                    >  to 10.65.0.1 via fxp0.0

{master}

I am not even sure where static routes fit in here but I leave that to you otherwise I will spend a long time trying to fight Junos and homer :)

Similarly for codfw,

/* high-traffic1 - backup route */
route 208.80.153.224/28 {
    next-hop 10.192.0.29;
    readvertise;
    no-resolve;
}
/* high-traffic2 - backup route */
route 208.80.153.240/28 {
    next-hop 10.192.16.140;
    readvertise;
    no-resolve;
}

Shouldn't these be pointing to the primary IPs? Thanks, and this is not urgent.

It seems the current static routes -- even if they matter at this stage -- are incorrect so I am just wondering if we should remove them at this stage

Heh good catch! To me that is just even more example of what we should not use them.

The only scenario they would kick in is if both the primary and backup LB are down (or basically if none of the /32s are announced in BGP at all). You can probably come up with some imaginative scenario in which that would happen - but the load-balancers otherwise were working fine - but it seems vanishingly unlikely.

I am not even sure where static routes fit in here but I leave that to you otherwise I will spend a long time trying to fight Junos and homer :)

It will show you only if you look for the /28 itself and use the 'exact' keyword:

cmooney@re0.cr1-eqiad> show route table inet.0 exact 208.80.154.224/28 active-path 

inet.0: 990546 destinations, 4261483 routes (989819 active, 2 holddown, 1903 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

208.80.154.224/28  *[Static/5] 40w6d 21:19:41
                    >  to 10.64.0.80 via ae1.1017

Shouldn't these be pointing to the primary IPs? Thanks, and this is not urgent.

Yes. But as per my comment above the scenario in which they'd get used is not clear, so agree it's not urgent. We could update them it is not much work, but my vote is to remove them instead.

Hi Netops folks: Thanks for your feedback.

Following up again after discussing this with Traffic. We decided that we will do away with the static routes in the edge sites (not all have them but some do) but keep the static routes updated and correct in the core sites. Once eqiad/codfw transition to Liberica, we will do away with the static routes there as well. That of course will happen when T352956 is resolved.

The reason for keeping the static routes in the core sites (and updated) is PyBal, and that we believe that there can be a situation in which it stops working/dies and takes the BGP advertisement with it; the idea is that static routes should help save us in that situation. I am not aware if this has really happened but we are going with the possibility that it may, so please correct us if we are wrong in our understanding that static routes will indeed save us there. We believe this to be less of a risk with Liberica though, since it not only gets respawned by systemd if it dies, but also that gobgpd does not withdraw the BGP advertisements.

I am happy taking care of the manual updates to the core sites CRs but let me know if there is any feedback/concerns from your end about this plan.

the idea is that static routes should help save us in that situation

That would only be the case for lvs1016, 1018, 1019 and 1020 as they're still in the "legacy" design, but for codfw and the other eqiad LVS more than static routes on the core routers are needed: static routes on the switch fabrics. This is because the LVS are not in a vlan directly connected to the core routers anymore.
For anything else, it's doable but a bit more work is needed.

the idea is that static routes should help save us in that situation

That would only be the case for lvs1016, 1018, 1019 and 1020 as they're still in the "legacy" design, but for codfw and the other eqiad LVS more than static routes on the core routers are needed: static routes on the switch fabrics. This is because the LVS are not in a vlan directly connected to the core routers anymore.
For anything else, it's doable but a bit more work is needed.

Thanks. I will do lvs in eqiad then and we can leave codfw as it is. I will also work on removing the statics from the edges -- let me know if you want to take that on or if I should take care of that. Thanks!

Mentioned in SAL (#wikimedia-operations) [2025-09-02T14:09:17Z] <XioNoX> eqsin: remove lvs static routes - T300877

Mentioned in SAL (#wikimedia-operations) [2025-09-02T14:15:24Z] <XioNoX> ulsfo: remove lvs static routes - T300877

Mentioned in SAL (#wikimedia-operations) [2025-09-02T14:26:26Z] <XioNoX> codfw: remove lvs static routes - T300877

Thanks for taking care of this @ayounsi! We will update this task when we are ready to remove the eqiad ones.