Page MenuHomePhabricator

cloudgw: develop HA setup
Closed, ResolvedPublic

Description

Introduce keepalived + conntrackd to the setup, document failover procedures.

This development is possible now that we have both cloudgw2001-dev (T271519) and cloudgw2002-dev (T271590) racked and ready to use.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+32 -4
operations/puppetproduction+15 -5
operations/puppetproduction+26 -13
operations/puppetproduction+5 -0
operations/puppetproduction+6 -1
operations/puppetproduction+3 -3
operations/puppetproduction+2 -2
operations/puppetproduction+2 -24
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+172 -1
operations/puppetproduction+82 -164
operations/puppetproduction+9 -1
operations/puppetproduction+115 -6
Show related patches Customize query in gerrit

Event Timeline

aborrero triaged this task as Medium priority.Jan 26 2021, 10:33 AM
aborrero created this task.
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Change 663799 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw2002-dev: give it proper puppet role

https://gerrit.wikimedia.org/r/663799

Change 663799 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw2002-dev: give it proper puppet role

https://gerrit.wikimedia.org/r/663799

Script wmf-auto-reimage was launched by aborrero on cumin2001.codfw.wmnet for hosts:

cloudgw2002-dev.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102121057_aborrero_28819_cloudgw2002-dev_codfw_wmnet.log.

Completed auto-reimage of hosts:

['cloudgw2002-dev.codfw.wmnet']

and were ALL successful.

Change 663801 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] keepalived: add support for custom template

https://gerrit.wikimedia.org/r/663801

Change 663801 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] keepalived: add support for custom template

https://gerrit.wikimedia.org/r/663801

Change 663823 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: introduce HA by using keepalived/VRRP

https://gerrit.wikimedia.org/r/663823

Change 664241 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: move common hiera into proper file

https://gerrit.wikimedia.org/r/664241

Change 664241 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: move common hiera into proper file

https://gerrit.wikimedia.org/r/664241

Change 663823 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: introduce HA by using keepalived/VRRP

https://gerrit.wikimedia.org/r/663823

Mentioned in SAL (#wikimedia-cloud) [2021-02-15T15:45:29Z] <arturo> [codfw1dev] connect virtual router cloudinstances2b-gw to vlan cloud-gw-transport-codfw (185.15.57.10) (T272963)

Mentioned in SAL (#wikimedia-cloud) [2021-02-15T15:45:54Z] <arturo> [codfw1dev] drop subnet definition for cloud-instances-transport1-b-codfw (T272963)

Change 664255 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] Revert "cloud: hiera: add vlan 2120 back into the neutron bridge"

https://gerrit.wikimedia.org/r/664255

Change 664255 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Revert "cloud: hiera: add vlan 2120 back into the neutron bridge"

https://gerrit.wikimedia.org/r/664255

Change 664256 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] Revert "cloud: hiera: connect cloudnet servers back to vlan 2120"

https://gerrit.wikimedia.org/r/664256

Change 664257 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] Revert "cloud: hiera: enable back neutron hacks in codfw1dev"

https://gerrit.wikimedia.org/r/664257

Change 664256 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Revert "cloud: hiera: connect cloudnet servers back to vlan 2120"

https://gerrit.wikimedia.org/r/664256

Change 664257 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] Revert "cloud: hiera: enable back neutron hacks in codfw1dev"

https://gerrit.wikimedia.org/r/664257

Change 664257 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Revert "cloud: hiera: enable back neutron hacks in codfw1dev"

https://gerrit.wikimedia.org/r/664257

Change 664307 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP

https://gerrit.wikimedia.org/r/664307

Change 664307 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: stop setting up VIP addresses that are now handle via keepalived/VRRP

https://gerrit.wikimedia.org/r/664307

Mentioned in SAL (#wikimedia-cloud) [2021-02-15T16:25:24Z] <arturo> [codfw1dev] rebooting all cloudgw200x-dev / cloudnet200x-dev servers (T272963)

Change 664311 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: switch data place interface config modes to manual

https://gerrit.wikimedia.org/r/664311

Change 664311 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: switch data place interface config modes to manual

https://gerrit.wikimedia.org/r/664311

Change 664317 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: interfaces: relax check on routing setup by using 'onlink'

https://gerrit.wikimedia.org/r/664317

Change 664317 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: interfaces: relax check on routing setup by using 'onlink'

https://gerrit.wikimedia.org/r/664317

Change 664521 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] conntrackd: also install the conntrack tool

https://gerrit.wikimedia.org/r/664521

Change 664521 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] conntrackd: also install the conntrack tool

https://gerrit.wikimedia.org/r/664521

Change 664538 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: cloudgw: allow incoming conntrackd TCP connection

https://gerrit.wikimedia.org/r/664538

Change 664538 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: cloudgw: allow incoming conntrackd TCP connection

https://gerrit.wikimedia.org/r/664538

Change 664549 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: use address per interface in the cloud-instance-transport subnet

https://gerrit.wikimedia.org/r/664549

Change 664549 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: use address per interface in the cloud-instance-transport subnet

https://gerrit.wikimedia.org/r/664549

Change 664603 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: let keepalived track static routes

https://gerrit.wikimedia.org/r/664603

Change 664603 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: let keepalived track additional static routes

https://gerrit.wikimedia.org/r/664603

Change 664785 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: set up conntrack sysctl parameters

https://gerrit.wikimedia.org/r/664785

Change 664785 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: set up conntrack sysctl parameters

https://gerrit.wikimedia.org/r/664785

Change 664789 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: keepalived: use nopreempt option

https://gerrit.wikimedia.org/r/664789

Change 664789 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: keepalived: use nopreempt option

https://gerrit.wikimedia.org/r/664789

Change 664800 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: start conntrackd before keepalived

https://gerrit.wikimedia.org/r/664800

Change 664800 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: refresh conntrackd service dependencies

https://gerrit.wikimedia.org/r/664800

This is in very good shape. I tested several failover scenarios:

  • manually stop keepalived in the primary VRRP node
  • reboot of the primary VRRP node
  • flapping (backup -> primary -> backup -> primary)

How I tested this:

  • ssh tools-codfw1dev-k8s-worker-1.tools-codfw1dev.codfw1dev.wikimedia.cloud
  • aborrero@tools-codfw1dev-k8s-worker-1:~$ wget https://network-tests.toolforge.org/files/1GB.bin -O /dev/null
  • ssh cloudgw2001-dev.codfw.wmnet --> reboot if primary
  • ssh cloudgw2002-dev.codfw.wmnet --> if new primary, watch traffic flowing
  • watch wget download still flowing despite several failovers