For T337446: Rebuild sanitarium hosts we need to run some maintenance work on wiki replicas, including dbproxy1018/dbproxy1019, but apparently there is some unwanted pybal impact.
Per IRC chat:
| 1 | 16:52 <vgutierrez> marostegui, we've been investigating a pybal issue that apparently is related to dbproxy1018 |
|---|---|
| 2 | 16:52 <vgutierrez> https://grafana.wikimedia.org/goto/vSRg8PQVk?orgId=1 |
| 3 | 16:53 <vgutierrez> IdleConnection seemed to be flapping a lot (aka connecting/disconnecting from dbproxy too quickly) from ~08:00 to ~14:00 today |
| 4 | 16:54 <vgutierrez> it roughly matches your ack on icinga |
| 5 | 16:54 <vgutierrez> and it completely matches the icinga alert |
| 6 | 16:54 <@marostegui> vgutierrez: I had no idea dbproxy1018 (wmcs proxies) had any implication on pybal |
| 7 | 16:55 <@marostegui> but yes, it is part of the outage at https://phabricator.wikimedia.org/T337446 |
| 8 | 16:55 <vgutierrez> PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 1: https://wikitech.wikimedia.org/wiki/HAProxy at 08:08 UTC |
| 9 | 16:55 <@marostegui> yes, I am aware of that alert |
| 10 | 16:55 <@marostegui> But there's not much I can do if I want to get this fixed |
| 11 | 16:55 <vgutierrez> marostegui: dbproxy1018 is exposed via high-traffic2 LVS through the wikireplicas service |
| 12 | 16:56 <@marostegui> vgutierrez: but it only affects wmcs users, right? |
| 13 | 16:56 <vgutierrez> wikireplicas maybe, high-traffic2 handles upload.wikimedia.org traffic as well |
| 14 | 16:56 <@marostegui> but does that issue affects upload.wikimedia.org too? |
| 15 | 16:57 <vgutierrez> potentially it could impact inbound traffic on upload.wm.o in eqiad yes |
| 16 | 16:57 <@marostegui> Then I have no idea what to do, because there will be more of those in the next few days |
| 17 | 16:57 <@marostegui> There is no other way for me to get this fixed |
| 18 | 16:58 — vgutierrez reading the task |
| 19 | 16:58 <@marostegui> vgutierrez: Not much to read, I basically have to stop two clouddb* hosts for a few hours to reclone them |
| 20 | 16:59 <@marostegui> And they are behind dbproxy1018 and dbproxy1019, which are wmcs proxies |
| 21 | 16:59 <vgutierrez> why that would impact haproxy ability of having a TCP connection open? |
| 22 | 16:59 <@marostegui> That I don't know |
| 23 | 16:59 <vgutierrez> lack of backend servers? |
| 24 | 16:59 <@marostegui> I guess |
| 25 | 16:59 <@marostegui> I don't know, I don't own this service at all |
| 26 | 17:01 <vgutierrez> could we set dbproxy1018 as inactive during that maintenance window? dcaro, arturo? |
| 27 | 17:01 <@marostegui> There is also dbproxy1019 involved on all this |
| 28 | 17:01 <@marostegui> In case it matters |
| 29 | 17:04 <vgutierrez> yep.. actually both hosts were impacted |
| 30 | 17:04 <vgutierrez> (from pybal's PoV) |
| 31 | 17:05 <vgutierrez> so assuming that during the maintenance window the dbproxies are unable to process incoming requests we would like to flag them as inactive to prevent pybal from healthhecking them |
| 32 | 17:07 <volans> but still pybal should not choke if a backend is not healthy/unable to respond to healthchecks |
| 33 | 17:08 <vgutierrez> volans: yep |
| 34 | 17:08 <vgutierrez> totally agree with that |
| 35 | 17:53 <_joe_> vgutierrez: pybal will chocke worse if it has no backends defined in a pool |
| 36 | 17:54 <_joe_> I would suggest to remove healthchecks for that service |