Page MenuHomePhabricator

Migrate Cloudelastic load balancing to IPIP encapsulation (LVS)
Closed, ResolvedPublic

Description

Migration steps:

  • depool cloudelastic (but keep monitoring its cluster health)
  • stop puppet on cloudelastic, lvs1020 and lvs1018
  • apply puppet on cloudelastic (one server at a time, watching cluster health checks) and validate that IPIP is working as expected [we use https://gitlab.wikimedia.org/-/snippets/107 for that]
  • apply puppet on lvs1020 and lvs1018
  • rolling restart of pybal on the impacted LVS servers (via sre.loadbalancer.restart-pybal cookbook ?)
  • validate from the bastion hosts that traffic towards the VIP works as expected
  • repool cloudelastic

Details

Related Changes in Gerrit:

Event Timeline

Gehel triaged this task as Medium priority.Jun 18 2024, 8:46 AM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.
Gehel moved this task from Scratch to Infrastructure on the Data-Platform-SRE board.

Change #1043302 had a related patch set uploaded (by Gehel; author: Bking):

[operations/puppet@production] cloudelastic: enable IPIP for LVS

https://gerrit.wikimedia.org/r/1043302

bking updated the task description. (Show Details)

Icinga downtime and Alertmanager silence (ID=81308338-4f99-4220-b27b-84a614289b52) set by bking@cumin2002 for 2:00:00 on 6 host(s) and their services with reason: IPIP migration

cloudelastic[1005-1010].eqiad.wmnet

Change #1043302 merged by Bking:

[operations/puppet@production] cloudelastic: enable IPIP for LVS

https://gerrit.wikimedia.org/r/1043302

Mentioned in SAL (#wikimedia-operations) [2024-06-20T14:47:44Z] <vgutierrez> rolling restart of pybal on lvs1020 and lvs1018 - T367511

I'm happy to report the migration completed successfully. As such, I'm closing out this ticket.

However, if you are noticing issues post-migration, please do re-open.

bking claimed this task.
bking moved this task from Backlog to Done on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.

Did this change fix T365154? video2commons became live again at the general surprise, around the time this change was applied.

Did this change fix T365154? video2commons became live again at the general surprise, around the time this change was applied.

I doubt very much this is related. I can't imagine a link between video2commons and cloudelastic (but I might lack imagination) and this change did not change anything functionally, it just changed the very low level way in which the traffic is flowing into cloudelastic.

Yes, taavi pointed us towards the change that repaired video2commons: https://phabricator.wikimedia.org/T365154#9915037