Page MenuHomePhabricator

Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS)
Closed, ResolvedPublic

Description

Per T363694 , we had some issues with pybal silently failing when its backends are not connected at layer 2. As pybal is unlikely to be updated, we can work around this by enabling IPIP encapsulation for our LVS pools. (See T357257 and related tickets)

Creating this ticket to:

  • Consult with Traffic team on our options
  • Decide whether or not to migrate.

If we decide to migrate, we'll create a separate ticket for the migration plan.

Event Timeline

Gehel triaged this task as Low priority.May 24 2024, 8:43 AM
Gehel moved this task from Incoming to Infrastructure on the Data-Platform-SRE board.
bking changed the task status from Open to Stalled.Jun 4 2024, 1:53 PM
bking added a project: Traffic.
bking added a subscriber: Vgutierrez.

Per IRC conversation with @Vgutierrez , this feature is not yet available. Tagging Traffic so they can ping us when it's ready.

Now that T365689 has been completed we can discuss tackling this one @bking.

In terms of the puppet repo we have the following requirements.

  1. service catalog needs to include a ipip_encapsulation key under the lvs stanza on the desired service. (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038291/2/hieradata/common/service.yaml is a nice example)
  2. realservers for that service need to include profile::lvs::realserver::ipip [https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038294/4/modules/role/manifests/cache/text.pp] and set the following hiera keys [https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038294/4 is a good example]:
    • profile::lvs::realserver::ipip::ipv4_mss [MSS clamping to be applied to IPv4 endpoints]
    • profile::lvs::realserver::ipip::ipv6_mss [MSS clamping to be applied to IPv6 endpoints]
    • profile::lvs::realserver::ipip::enabled: true [used to signal that IPIP is enabled in that host, the needs to be applied cluster/DC wide as pybal doesn't support mixed realservers
    • profile::base::enable_rp_filter: false [ we need to disable rp filter because request traffic will reach the realserver via ipip0 or ipip60 and response will go out using the primary NIC interface]

We already included a sane default for profile::lvs::realserver::ipip::interfaces on common/profile/lvs/realservers/ipip.yaml, so assuming that your realservers only use $facts[interface_primary] to handle LVS traffic you won't need to override it.

Deployment would work like this:

  1. depool the impacted cluster/DC
  2. puppet changes with service.yaml and realservers changes need to be merged together.
  3. apply puppet on realservers and validate that IPIP is working as expected [we use https://gitlab.wikimedia.org/-/snippets/107 for that]
  4. apply puppet on the involved LVS servers
  5. rolling restart of pybal on the impacted LVS servers
  6. validate from the bastion hosts that traffic towards the VIP works as expected
  7. repool the impacted cluster/DC

@Vgutierrez Awesome, thank you for the comprehensive plan of action. I'll get to work on the puppet patches. Once the patches are ready, we can figure out a maintenance window if that works for you.

bking changed the task status from Stalled to In Progress.Jun 13 2024, 10:54 PM
bking claimed this task.

Change #1043302 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: enable IPIP for LVS

https://gerrit.wikimedia.org/r/1043302

Gehel moved this task from Backlog to Done on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.
Gehel subscribed.

Decision is made, implementation following on T367511