15:04:29 <bblack> so, we'd have to configure such a service in pybal itself (with no real backends to monitor or use, which would probably break some stuff), or just pick one of the existing service IPs that we think is long-term stable, one of the two. 15:04:43 <bblack> like appservers.svc 15:05:18 <bblack> basically template the local appservers.svc IP into the "instrumentation_ips", and have any cluster that needs this monitoring data use http://appservers.svc.${dc}.wmnet:9090/... 15:07:09 <bblack> (LVS only lvs-routes the specific ports we actually use on the appservers IP. So the local :9090 listener should work fine from lvs itself) 15:08:29 <bblack> either that, or patch pybal code to configure+advertise a special IP for monitoring that's in the right subnet, but not an existing service IP 15:08:34 <bblack> that's cleaner, but a lot more work
Description
Details
Related Objects
- Mentioned Here
- P46678 restart stacktraces
Event Timeline
For future reference, this left 89 out of 280 appservers and 9 out of 20 parsoid servers depooled in codfw until this EU morning where we started getting alerts for overloaded parsoid servers.
I've prepared this CR to have an alert when we have less than 80% of appservers in a site pooled for more than 1h.
Mentioned in SAL (#wikimedia-operations) [2023-04-17T14:40:05Z] <ladsgroup@deploy2002> Locking from deployment [ALL REPOSITORIES]: LVS Maint - Outage (T334703)
So, the solution quoted from my IRC chat above: that's about making the depool verification code actually track the currently-live "low-traffic" (applayer/internal) LVS routing, as opposed to what it's doing now (which I think checks the primary+secondary for the role as-configured in puppet, which doesn't account for any failure/depool/etc at the LVS layer).
There should probably be another fix as well, somewhere in the deployment tooling, so that it doesn't carry on with the carnage if it's unable to make contact with LVS for verification. Maybe also an option to ignore LVS state and operate blindly, which could be used during an emergency deploy while LVS is misbehaving.
Mentioned in SAL (#wikimedia-operations) [2023-04-17T14:53:45Z] <ladsgroup@deploy2002> Unlocked for deployment [ALL REPOSITORIES]: LVS Maint - Outage (T334703) (duration: 13m 39s)
Probably needs subtasks for two things:
- Fix "safe-service-restart.py" being unsafe (either it or its caller is failing to propogate an error upstream to stop the carnage, and is also leaving a node depooled when the error happens between the depool and repool operations. At least one of those needs fixing, if not both).
- The whole 'template the local appservers.svc IP into the "instrumentation_ips"' thing at the pybal level, plus whatever changes are needed to use it from the scap side of things (so that it only checks one local pybal, and it's the correct one by current pooling).
I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from an appserver would work - that IP resolves locally to the loopback interface on any appserver. We'd need to pick another internal IP.
Also, pybal's http monitoring is for security reasons is only bound to local IPs, we should evaluate if that could be a risk on the backup LVS.
Change 923389 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/debs/pybal@1.15-stretch] pybal: add support for advertised instrumentation
Change 923389 merged by Ssingh:
[operations/debs/pybal@1.15-stretch] pybal: add support for advertised instrumentation
Change 923399 had a related patch set uploaded (by Ssingh; author: Ssingh):
[operations/debs/pybal@1.15-stretch] Release 1.15.12
Change 923404 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/debs/pybal@1.15-stretch] pybal: quick bugfix for advertised instrumentation
Change 923404 merged by Ssingh:
[operations/debs/pybal@1.15-stretch] pybal: quick bugfix for advertised instrumentation
Change 923405 had a related patch set uploaded (by Ssingh; author: Ssingh):
[operations/debs/pybal@1.15-stretch] Release 1.15.13
Change 923414 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/dns@master] Add pybal-low-traffic.svc.codfw.wmnet
Change 923414 merged by BBlack:
[operations/dns@master] Add pybal-low-traffic.svc.codfw.wmnet
Change 923598 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/dns@master] Add pybal-low-traffic.svc.eqiad.wmnet
Change 923598 merged by BBlack:
[operations/dns@master] Add pybal-low-traffic.svc.eqiad.wmnet
As you seem to be working on this I'm bluntly assigning to you as part of the incident followup.
Seems fine for now, thanks!
This is a good point. We were trying to avoid picking another IP because it involves either a pybal code change or some kind of dummy-service. We went with the pybal code change path (already merged and packaged, but not yet globally deployed + configured), and I've now defined floating instrumentation IPs for all the traffic classes at all the sites, which you can see in netbox at https://netbox.wikimedia.org/ipam/ip-addresses/?q=Failover
Also, pybal's http monitoring is for security reasons is only bound to local IPs, we should evaluate if that could be a risk on the backup LVS.
Our border-in policies on the hardware routers block port 9090 to the LVS service ranges, so we should be ok there, even for the public-subnet LVS classes (cf https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/policies/cr-border-in.pol#114 ; PyBal is defined elsewhere as 9090/tcp).
Change 924593 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/puppet@production] [WIP] pybal: configure failover i13n IPs
Change 924596 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/puppet@production] [WIP] safe-service-restart: use failover i13n
We've got a pair of patches to review now which configure this on the pybal and safe-service-restart sides. We could especially use serviceops input on the latter. None of it's particularly pretty, but at least it's fairly succinct and seems to do the job!
pybal part:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593/9
safe-service-restart part:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/924596/3
Change 924974 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/puppet@production] safe-service-restart: pre-verify the verifier
Mentioned in SAL (#wikimedia-operations) [2023-06-01T15:15:33Z] <bblack> lvs400[89]: upgrade pybal to 1.15.13 - T334703
Mentioned in SAL (#wikimedia-operations) [2023-06-01T16:32:29Z] <bblack> lvs400[89]: upgrade pybal to 1.15.13 - T334703 (round 2!)
Mentioned in SAL (#wikimedia-operations) [2023-06-01T16:35:17Z] <bblack> lvs5* (eqsin): upgrade pybal to 1.15.13 - T334703
Mentioned in SAL (#wikimedia-operations) [2023-06-01T16:42:30Z] <bblack> lvs2* (codfw): upgrade pybal to 1.15.13 - T334703
Mentioned in SAL (#wikimedia-operations) [2023-06-01T18:33:29Z] <bblack> lvs3* (esams): upgrade pybal to 1.15.13 - T334703
Mentioned in SAL (#wikimedia-operations) [2023-06-01T18:45:27Z] <bblack> lvs6* (drmrs): upgrade pybal to 1.15.13 - T334703
Mentioned in SAL (#wikimedia-operations) [2023-06-01T19:09:11Z] <bblack> lvs1* (eqiad): upgrade pybal to 1.15.13 - T334703
Mentioned in SAL (#wikimedia-traffic) [2023-06-05T12:15:36Z] <bblack> lvs*: disabling puppet to roll out new LVS IPs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593 - T334703
Mentioned in SAL (#wikimedia-operations) [2023-06-05T12:15:55Z] <bblack> lvs*: disabling puppet to roll out new LVS IPs in https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593 - T334703
Change 924593 merged by BBlack:
[operations/puppet@production] pybal: configure failover i13n IPs
Change 927168 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/puppet@production] pybal: configure advertised_instrumentation_ips
Change 927168 merged by BBlack:
[operations/puppet@production] pybal: configure advertised_instrumentation_ips
Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:09:45Z] <bblack> lvs4* (ulsfo) - restart pybal for T334703 IPs
Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:15:30Z] <bblack> lvs6* (drmrs) - restart pybal for T334703 IPs
Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:19:21Z] <bblack> lvs5* (eqsin) - restart pybal for T334703 IPs
Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:25:00Z] <bblack> lvs3* (esams) - restart pybal for T334703 IPs
Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:29:56Z] <bblack> lvs2* (codfw) - restart pybal for T334703 IPs
Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:32:31Z] <bblack> lvs1* (eqiad) - restart pybal for T334703 IPs
Change 927200 had a related patch set uploaded (by BBlack; author: BBlack):
[operations/puppet@production] wikidata maxlag maint script: use new pybal VIPs
Change 927200 merged by BBlack:
[operations/puppet@production] wikidata maxlag maint script: use new pybal VIPs
Change 924596 merged by BBlack:
[operations/puppet@production] safe-service-restart: use failover i13n