Page MenuHomePhabricator

Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki)
Closed, ResolvedPublic

Description

118:31:24 /usr/bin/sudo -u root -- /usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807 (ran as mwdeploy@mw2364.codfw.wmnet) returned [2]: 2023-04-13 18:30:49,384 [INFO] Depooling currently pooled services
22023-04-13 18:30:54,469 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f77702a4630>: Failed to establish a new connection: [Errno 111] Connection refused'))
32023-04-13 18:30:59,477 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f77702401d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
42023-04-13 18:31:02,550 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f77702408d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
52023-04-13 18:31:05,622 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7770240fd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
62023-04-13 18:31:08,694 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f77702402b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
72023-04-13 18:31:11,766 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7770107828>: Failed to establish a new connection: [Errno 111] Connection refused'))
82023-04-13 18:31:14,838 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7770107320>: Failed to establish a new connection: [Errno 111] Connection refused'))
92023-04-13 18:31:17,910 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7770249710>: Failed to establish a new connection: [Errno 111] Connection refused'))
102023-04-13 18:31:20,982 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7770249e10>: Failed to establish a new connection: [Errno 111] Connection refused'))
112023-04-13 18:31:24,054 [WARNING] Issues connecting to lvs2009:9090: HTTPConnectionPool(host='lvs2009', port=9090): Max retries exceeded with url: /pools/api-https_443/mw2364.codfw.wmnet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7770244550>: Failed to establish a new connection: [Errno 111] Connection refused'))
122023-04-13 18:31:24,054 [ERROR] Error depooling the servers: Never successfully retrieved http://lvs2009:9090/pools/api-https_443/mw2364.codfw.wmnet
132023-04-13 18:31:24,054 [ERROR] Error running command with poolcounter: ('Failed executing ServiceRunner.run, return code %d', 127)
14Traceback (most recent call last):
15 File "/usr/lib/python3/dist-packages/poolcounter/client.py", line 377, in run
16 callback(*args)
17 File "/usr/local/bin/safe-service-restart", line 151, in run_and_raise
18 raise RuntimeError("Failed executing ServiceRunner.run, return code %d", rc)
19RuntimeError: ('Failed executing ServiceRunner.run, return code %d', 127)

15:04:29	<bblack>	so, we'd have to configure such a service in pybal itself (with no real backends to monitor or use, which would probably break some stuff), or just pick one of the existing service IPs that we think is long-term stable, one of the two.
15:04:43	<bblack>	like appservers.svc
15:05:18	<bblack>	basically template the local appservers.svc IP into the "instrumentation_ips", and have any cluster that needs this monitoring data use http://appservers.svc.${dc}.wmnet:9090/...
15:07:09	<bblack>	(LVS only lvs-routes the specific ports we actually use on the appservers IP.  So the local :9090 listener should work fine from lvs itself)
15:08:29	<bblack>	either that, or patch pybal code to configure+advertise a special IP for monitoring that's in the right subnet, but not an existing service IP
15:08:34	<bblack>	that's cleaner, but a lot more work

Related Objects

Event Timeline

For future reference, this left 89 out of 280 appservers and 9 out of 20 parsoid servers depooled in codfw until this EU morning where we started getting alerts for overloaded parsoid servers.
I've prepared this CR to have an alert when we have less than 80% of appservers in a site pooled for more than 1h.

Mentioned in SAL (#wikimedia-operations) [2023-04-17T14:40:05Z] <ladsgroup@deploy2002> Locking from deployment [ALL REPOSITORIES]: LVS Maint - Outage (T334703)

So, the solution quoted from my IRC chat above: that's about making the depool verification code actually track the currently-live "low-traffic" (applayer/internal) LVS routing, as opposed to what it's doing now (which I think checks the primary+secondary for the role as-configured in puppet, which doesn't account for any failure/depool/etc at the LVS layer).

There should probably be another fix as well, somewhere in the deployment tooling, so that it doesn't carry on with the carnage if it's unable to make contact with LVS for verification. Maybe also an option to ignore LVS state and operate blindly, which could be used during an emergency deploy while LVS is misbehaving.

Mentioned in SAL (#wikimedia-operations) [2023-04-17T14:53:45Z] <ladsgroup@deploy2002> Unlocked for deployment [ALL REPOSITORIES]: LVS Maint - Outage (T334703) (duration: 13m 39s)

@BBlack and @CDanis: Could the ticket title/description be updated with a more specific actionable? If this is more involved it's probably better to split into multiple single-task tickets. Thanks!

Probably needs subtasks for two things:

  1. Fix "safe-service-restart.py" being unsafe (either it or its caller is failing to propogate an error upstream to stop the carnage, and is also leaving a node depooled when the error happens between the depool and repool operations. At least one of those needs fixing, if not both).
  2. The whole 'template the local appservers.svc IP into the "instrumentation_ips"' thing at the pybal level, plus whatever changes are needed to use it from the scap side of things (so that it only checks one local pybal, and it's the correct one by current pooling).

I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from an appserver would work - that IP resolves locally to the loopback interface on any appserver. We'd need to pick another internal IP.

Also, pybal's http monitoring is for security reasons is only bound to local IPs, we should evaluate if that could be a risk on the backup LVS.

Change 923389 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/debs/pybal@1.15-stretch] pybal: add support for advertised instrumentation

https://gerrit.wikimedia.org/r/923389

Change 923389 merged by Ssingh:

[operations/debs/pybal@1.15-stretch] pybal: add support for advertised instrumentation

https://gerrit.wikimedia.org/r/923389

Change 923399 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/pybal@1.15-stretch] Release 1.15.12

https://gerrit.wikimedia.org/r/923399

Change 923399 merged by Ssingh:

[operations/debs/pybal@1.15-stretch] Release 1.15.12

https://gerrit.wikimedia.org/r/923399

Change 923404 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/debs/pybal@1.15-stretch] pybal: quick bugfix for advertised instrumentation

https://gerrit.wikimedia.org/r/923404

Change 923404 merged by Ssingh:

[operations/debs/pybal@1.15-stretch] pybal: quick bugfix for advertised instrumentation

https://gerrit.wikimedia.org/r/923404

Change 923405 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/pybal@1.15-stretch] Release 1.15.13

https://gerrit.wikimedia.org/r/923405

Change 923405 merged by Ssingh:

[operations/debs/pybal@1.15-stretch] Release 1.15.13

https://gerrit.wikimedia.org/r/923405

Change 923414 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] Add pybal-low-traffic.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/923414

Change 923414 merged by BBlack:

[operations/dns@master] Add pybal-low-traffic.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/923414

Change 923598 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] Add pybal-low-traffic.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/923598

Change 923598 merged by BBlack:

[operations/dns@master] Add pybal-low-traffic.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/923598

JMeybohm added a subscriber: JMeybohm.

As you seem to be working on this I'm bluntly assigning to you as part of the incident followup.

As you seem to be working on this I'm bluntly assigning to you as part of the incident followup.

Seems fine for now, thanks!

I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from an appserver would work - that IP resolves locally to the loopback interface on any appserver. We'd need to pick another internal IP.

This is a good point. We were trying to avoid picking another IP because it involves either a pybal code change or some kind of dummy-service. We went with the pybal code change path (already merged and packaged, but not yet globally deployed + configured), and I've now defined floating instrumentation IPs for all the traffic classes at all the sites, which you can see in netbox at https://netbox.wikimedia.org/ipam/ip-addresses/?q=Failover

Also, pybal's http monitoring is for security reasons is only bound to local IPs, we should evaluate if that could be a risk on the backup LVS.

Our border-in policies on the hardware routers block port 9090 to the LVS service ranges, so we should be ok there, even for the public-subnet LVS classes (cf https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/policies/cr-border-in.pol#114 ; PyBal is defined elsewhere as 9090/tcp).

Change 924593 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] [WIP] pybal: configure failover i13n IPs

https://gerrit.wikimedia.org/r/924593

Change 924596 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] [WIP] safe-service-restart: use failover i13n

https://gerrit.wikimedia.org/r/924596

We've got a pair of patches to review now which configure this on the pybal and safe-service-restart sides. We could especially use serviceops input on the latter. None of it's particularly pretty, but at least it's fairly succinct and seems to do the job!

pybal part:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/924593/9

safe-service-restart part:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/924596/3

Change 924974 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] safe-service-restart: pre-verify the verifier

https://gerrit.wikimedia.org/r/924974

Mentioned in SAL (#wikimedia-operations) [2023-06-01T15:15:33Z] <bblack> lvs400[89]: upgrade pybal to 1.15.13 - T334703

Mentioned in SAL (#wikimedia-operations) [2023-06-01T16:32:29Z] <bblack> lvs400[89]: upgrade pybal to 1.15.13 - T334703 (round 2!)

Mentioned in SAL (#wikimedia-operations) [2023-06-01T16:35:17Z] <bblack> lvs5* (eqsin): upgrade pybal to 1.15.13 - T334703

Mentioned in SAL (#wikimedia-operations) [2023-06-01T16:42:30Z] <bblack> lvs2* (codfw): upgrade pybal to 1.15.13 - T334703

Mentioned in SAL (#wikimedia-operations) [2023-06-01T18:33:29Z] <bblack> lvs3* (esams): upgrade pybal to 1.15.13 - T334703

Mentioned in SAL (#wikimedia-operations) [2023-06-01T18:45:27Z] <bblack> lvs6* (drmrs): upgrade pybal to 1.15.13 - T334703

Mentioned in SAL (#wikimedia-operations) [2023-06-01T19:09:11Z] <bblack> lvs1* (eqiad): upgrade pybal to 1.15.13 - T334703

Change 924593 merged by BBlack:

[operations/puppet@production] pybal: configure failover i13n IPs

https://gerrit.wikimedia.org/r/924593

Change 927168 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] pybal: configure advertised_instrumentation_ips

https://gerrit.wikimedia.org/r/927168

Change 927168 merged by BBlack:

[operations/puppet@production] pybal: configure advertised_instrumentation_ips

https://gerrit.wikimedia.org/r/927168

Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:09:45Z] <bblack> lvs4* (ulsfo) - restart pybal for T334703 IPs

Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:15:30Z] <bblack> lvs6* (drmrs) - restart pybal for T334703 IPs

Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:19:21Z] <bblack> lvs5* (eqsin) - restart pybal for T334703 IPs

Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:25:00Z] <bblack> lvs3* (esams) - restart pybal for T334703 IPs

Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:29:56Z] <bblack> lvs2* (codfw) - restart pybal for T334703 IPs

Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:32:31Z] <bblack> lvs1* (eqiad) - restart pybal for T334703 IPs

Change 927200 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] wikidata maxlag maint script: use new pybal VIPs

https://gerrit.wikimedia.org/r/927200

Change 927200 merged by BBlack:

[operations/puppet@production] wikidata maxlag maint script: use new pybal VIPs

https://gerrit.wikimedia.org/r/927200

Change 924596 merged by BBlack:

[operations/puppet@production] safe-service-restart: use failover i13n

https://gerrit.wikimedia.org/r/924596

AIUI this is now resolved