Roll restart haproxy to apply updated configuration
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Jul 28 2021, 12:16 PM

Description

I have two patches out for haproxy needed to Bullseye support and tech debt reduction, https://gerrit.wikimedia.org/r/c/operations/puppet/+/708105 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/708108 respectively.

The hosts running haproxy module are:

cloudcontrol[2001,2003-2004]-dev.wikimedia.org
cloudcontrol[1003-1005].wikimedia.org
dbproxy[2001-2003].codfw.wmnet T287574#7248065
dbproxy[1012-1021].eqiad.wmnet T287574#7248065
thumbor[2001-2004].codfw.wmnet
thumbor[1001-1004].eqiad.wmnet

I know how to do thumbor hosts and will do myself, though I need help with the other hosts. The patches will restart haproxy (restart is going to be fast but not hitless from client's POV I think) and I'm assuming we'd need to do some coordinated rollout.

Note this is not urgent but we should do it, I'm not the haproxy owner/maintainer but I needed it on Bullseye, hence the patches, thanks!

Details

	Subject	Repo	Branch	Lines +/-
	wmnet: Failover m1, m2 and m3-master	operations/dns	master	+3 -3

Customize query in gerrit

Event Timeline

fgiunchedi created this task.Jul 28 2021, 12:16 PM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptJul 28 2021, 12:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

fgiunchedi added a project: User-fgiunchedi.Jul 28 2021, 12:18 PM

dbproxy2* can be done anytime
dbproxy1018 and dbproxy1019 are owned by the cloud services team.
The other dbproxies hosts are owned by DBA
The active ones are:

dbproxy1014
dbproxy1013

I can take care of failing those over just in case.
Can this wait until we are done with the network maintenance that happens tomorrow?

Marostegui moved this task from Triage to Blocked on the DBA board.Jul 28 2021, 12:22 PM

Thank you @Marostegui for the info! Yes this can wait next week or the week after no problem.

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jul 28 2021, 2:43 PM

@fgiunchedi we'd need to coordinate this in a way as this would arrive to all hosts as soon as puppet runs.
My idea would be to stop puppet on the active dbproxies (2 for us), let it run on the standby ones, and then failover to them. This operation shouldn't take longer than 30 minutes or so (DNS TTL for these are 5 minutes, but let's be gentle just in case).
Regarding the dbproxy hosts owned by WMCS, that's up to them, but I guess a few seconds of downtime (the restart time) should be ok, but they need to confirm.

Let me know your thoughts!

Adding @nskaggs and @Bstorm for visibility.

In T287574#7245440, @Marostegui wrote:

@fgiunchedi we'd need to coordinate this in a way as this would arrive to all hosts as soon as puppet runs.
My idea would be to stop puppet on the active dbproxies (2 for us), let it run on the standby ones, and then failover to them. This operation shouldn't take longer than 30 minutes or so (DNS TTL for these are 5 minutes, but let's be gentle just in case).

Plan SGTM! I'll be doing a similar thing on thumbor hosts but with depool/repool

Restarting haproxies in wmcs is fairly harmless, just ping when it's time.

@fgiunchedi with the switches maintenance finished, we can proceed whenever you like with this. Just give me 24h heads up and it should be fine
Thanks!

Thanks all for your help! Let's go for Tues next week (i.e. Aug 3rd). Easiest would be around 9 UTC, does that work on your end @Andrew ?

Works for me!

Active hosts that would need puppet stopped and failed over before applying the change:

dbproxy1014 m1
dbproxy1013 m2
dbproxy1020 m3

Fine with some interruption:
dbproxy1018 (WMCS - fine to have a brief restart per T287574#7246686)
dbproxy1019 (WMCS - fine to have a brief restart per T287574#7246686)

Standby proxies can be done anytime and m1, m2 and m3, once done, need to become active:
dbproxy1012 m1
dbproxy1015 m2
dbproxy1016 m3
dbproxy1017 m5
dbproxy1021 m5

Passive proxies
dbproxy2001
dbproxy2002
dbproxy2003

Marostegui updated the task description. (Show Details)Jul 30 2021, 8:16 AM

In T287574#7248022, @fgiunchedi wrote:

Thanks all for your help! Let's go for Tues next week (i.e. Aug 3rd). Easiest would be around 9 UTC, does that work on your end @Andrew ?

That's the middle of the night for me but might suit @dcaro -- I've made a calendar entry.

Ack, I'll be there 👍

Excellent, thank you @Andrew and @dcaro !

We're on for Tues Aug 3rd at 9 UTC

I have disabled puppet on the active dbproxies:

dbproxy1013
dbproxy1014
dbproxy1020

Change 709591 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Failover m1, m2 and m3-master

https://gerrit.wikimedia.org/r/709591

gerritbot added a project: Patch-For-Review.Aug 3 2021, 5:03 AM

The above patch is ready to be merged and deployed once the standby dbproxies are done.

Thank you @Marostegui !

To recap here's my plan:

stop puppet on C:haproxy
merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/708108 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/708105
depool and enable puppet on thumbor1001.eqiad.wmnet, check everything is working as expected
re-enable puppet on thumbor*
check the remaining hosts with service owners and reenable puppet as needed

Change 709591 merged by Marostegui:

[operations/dns@master] wmnet: Failover m1, m2 and m3-master

https://gerrit.wikimedia.org/r/709591

Mentioned in SAL (#wikimedia-operations) [2021-08-03T09:18:57Z] <marostegui> Failover m1, m2 and m3-master T287574

The proxies were failed over, and the old active ones got puppet enabled + run and all ok.
Closing this as per our IRC chat,

Maintenance_bot removed a project: Patch-For-Review.Aug 3 2021, 10:10 AM

Roll restart haproxy to apply updated configurationClosed, ResolvedPublicActions

Description

Details

Event Timeline

Roll restart haproxy to apply updated configuration
Closed, ResolvedPublic
Actions