Page MenuHomePhabricator

Roll restart haproxy to apply updated configuration
Closed, ResolvedPublic

Description

I have two patches out for haproxy needed to Bullseye support and tech debt reduction, https://gerrit.wikimedia.org/r/c/operations/puppet/+/708105 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/708108 respectively.

The hosts running haproxy module are:

  • cloudcontrol[2001,2003-2004]-dev.wikimedia.org
  • cloudcontrol[1003-1005].wikimedia.org
  • dbproxy[2001-2003].codfw.wmnet T287574#7248065
  • dbproxy[1012-1021].eqiad.wmnet T287574#7248065
  • thumbor[2001-2004].codfw.wmnet
  • thumbor[1001-1004].eqiad.wmnet

I know how to do thumbor hosts and will do myself, though I need help with the other hosts. The patches will restart haproxy (restart is going to be fast but not hitless from client's POV I think) and I'm assuming we'd need to do some coordinated rollout.

Note this is not urgent but we should do it, I'm not the haproxy owner/maintainer but I needed it on Bullseye, hence the patches, thanks!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui added a subscriber: Marostegui.

dbproxy2* can be done anytime
dbproxy1018 and dbproxy1019 are owned by the cloud services team.
The other dbproxies hosts are owned by DBA
The active ones are:

  • dbproxy1014
  • dbproxy1013

I can take care of failing those over just in case.
Can this wait until we are done with the network maintenance that happens tomorrow?

Thank you @Marostegui for the info! Yes this can wait next week or the week after no problem.

@fgiunchedi we'd need to coordinate this in a way as this would arrive to all hosts as soon as puppet runs.
My idea would be to stop puppet on the active dbproxies (2 for us), let it run on the standby ones, and then failover to them. This operation shouldn't take longer than 30 minutes or so (DNS TTL for these are 5 minutes, but let's be gentle just in case).
Regarding the dbproxy hosts owned by WMCS, that's up to them, but I guess a few seconds of downtime (the restart time) should be ok, but they need to confirm.

Let me know your thoughts!

@fgiunchedi we'd need to coordinate this in a way as this would arrive to all hosts as soon as puppet runs.
My idea would be to stop puppet on the active dbproxies (2 for us), let it run on the standby ones, and then failover to them. This operation shouldn't take longer than 30 minutes or so (DNS TTL for these are 5 minutes, but let's be gentle just in case).

Plan SGTM! I'll be doing a similar thing on thumbor hosts but with depool/repool

Restarting haproxies in wmcs is fairly harmless, just ping when it's time.

@fgiunchedi with the switches maintenance finished, we can proceed whenever you like with this. Just give me 24h heads up and it should be fine
Thanks!

Thanks all for your help! Let's go for Tues next week (i.e. Aug 3rd). Easiest would be around 9 UTC, does that work on your end @Andrew ?

Active hosts that would need puppet stopped and failed over before applying the change:

dbproxy1014 m1
dbproxy1013 m2
dbproxy1020 m3

Fine with some interruption:
dbproxy1018 (WMCS - fine to have a brief restart per T287574#7246686)
dbproxy1019 (WMCS - fine to have a brief restart per T287574#7246686)

Standby proxies can be done anytime and m1, m2 and m3, once done, need to become active:
dbproxy1012 m1
dbproxy1015 m2
dbproxy1016 m3
dbproxy1017 m5
dbproxy1021 m5

Passive proxies
dbproxy2001
dbproxy2002
dbproxy2003

Thanks all for your help! Let's go for Tues next week (i.e. Aug 3rd). Easiest would be around 9 UTC, does that work on your end @Andrew ?

That's the middle of the night for me but might suit @dcaro -- I've made a calendar entry.

Excellent, thank you @Andrew and @dcaro !

We're on for Tues Aug 3rd at 9 UTC

I have disabled puppet on the active dbproxies:

  • dbproxy1013
  • dbproxy1014
  • dbproxy1020

Change 709591 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Failover m1, m2 and m3-master

https://gerrit.wikimedia.org/r/709591

The above patch is ready to be merged and deployed once the standby dbproxies are done.

Thank you @Marostegui !

To recap here's my plan:

  1. stop puppet on C:haproxy
  2. merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/708108 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/708105
  3. depool and enable puppet on thumbor1001.eqiad.wmnet, check everything is working as expected
  4. re-enable puppet on thumbor*
  5. check the remaining hosts with service owners and reenable puppet as needed

Change 709591 merged by Marostegui:

[operations/dns@master] wmnet: Failover m1, m2 and m3-master

https://gerrit.wikimedia.org/r/709591

Mentioned in SAL (#wikimedia-operations) [2021-08-03T09:18:57Z] <marostegui> Failover m1, m2 and m3-master T287574

The proxies were failed over, and the old active ones got puppet enabled + run and all ok.
Closing this as per our IRC chat,