Page MenuHomePhabricator

Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact
Closed, InvalidPublic

Description

For T337446: Rebuild sanitarium hosts we need to run some maintenance work on wiki replicas, including dbproxy1018/dbproxy1019, but apparently there is some unwanted pybal impact.

Per IRC chat:

116:52 <vgutierrez> marostegui, we've been investigating a pybal issue that apparently is related to dbproxy1018
216:52 <vgutierrez> https://grafana.wikimedia.org/goto/vSRg8PQVk?orgId=1
316:53 <vgutierrez> IdleConnection seemed to be flapping a lot (aka connecting/disconnecting from dbproxy too quickly) from ~08:00 to ~14:00 today
416:54 <vgutierrez> it roughly matches your ack on icinga
516:54 <vgutierrez> and it completely matches the icinga alert
616:54 <@marostegui> vgutierrez: I had no idea dbproxy1018 (wmcs proxies) had any implication on pybal
716:55 <@marostegui> but yes, it is part of the outage at https://phabricator.wikimedia.org/T337446
816:55 <vgutierrez> PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 1: https://wikitech.wikimedia.org/wiki/HAProxy at 08:08 UTC
916:55 <@marostegui> yes, I am aware of that alert
1016:55 <@marostegui> But there's not much I can do if I want to get this fixed
1116:55 <vgutierrez> marostegui: dbproxy1018 is exposed via high-traffic2 LVS through the wikireplicas service
1216:56 <@marostegui> vgutierrez: but it only affects wmcs users, right?
1316:56 <vgutierrez> wikireplicas maybe, high-traffic2 handles upload.wikimedia.org traffic as well
1416:56 <@marostegui> but does that issue affects upload.wikimedia.org too?
1516:57 <vgutierrez> potentially it could impact inbound traffic on upload.wm.o in eqiad yes
1616:57 <@marostegui> Then I have no idea what to do, because there will be more of those in the next few days
1716:57 <@marostegui> There is no other way for me to get this fixed
1816:58 — vgutierrez reading the task
1916:58 <@marostegui> vgutierrez: Not much to read, I basically have to stop two clouddb* hosts for a few hours to reclone them
2016:59 <@marostegui> And they are behind dbproxy1018 and dbproxy1019, which are wmcs proxies
2116:59 <vgutierrez> why that would impact haproxy ability of having a TCP connection open?
2216:59 <@marostegui> That I don't know
2316:59 <vgutierrez> lack of backend servers?
2416:59 <@marostegui> I guess
2516:59 <@marostegui> I don't know, I don't own this service at all
2617:01 <vgutierrez> could we set dbproxy1018 as inactive during that maintenance window? dcaro, arturo?
2717:01 <@marostegui> There is also dbproxy1019 involved on all this
2817:01 <@marostegui> In case it matters
2917:04 <vgutierrez> yep.. actually both hosts were impacted
3017:04 <vgutierrez> (from pybal's PoV)
3117:05 <vgutierrez> so assuming that during the maintenance window the dbproxies are unable to process incoming requests we would like to flag them as inactive to prevent pybal from healthhecking them
3217:07 <volans> but still pybal should not choke if a backend is not healthy/unable to respond to healthchecks
3317:08 <vgutierrez> volans: yep
3417:08 <vgutierrez> totally agree with that
3517:53 <_joe_> vgutierrez: pybal will chocke worse if it has no backends defined in a pool
3617:54 <_joe_> I would suggest to remove healthchecks for that service

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
12:48 <marostegui> ok, s3 needs to be depooled entirely
12:48 <marostegui> so that means clouddb1013 and clouddb1017

Ok I think I have a plan:

  • to workaround this LVS/pybal interaction, let's drop the wikireplica S3 definition from hiera/puppet entirely
  • then @Marostegui can perform the maintenance operation
  • we can redefine them in LVS/pybal when the operation is completed and move on to the next.

I'll craft a patch with this.

Change 924481 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] lvs: remove wikireplicas S3 definition

https://gerrit.wikimedia.org/r/924481

Change 924481 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] lvs: remove wikireplicas S3 definition

Reason:

another solution was found.

https://gerrit.wikimedia.org/r/924481

@BBlack the fix at T337446#8888642 can now be reverted as everything is stable.

taavi subscribed.

Pybal is no longer used here.