Page MenuHomePhabricator

Move eqiad frack to new infra
Closed, ResolvedPublic

Description

Planning the switchover from pfw-eqiad to pfw3-eqiad/fasw-eqiad

Aiming for September 26th at 15:00 UTC / 11:00 EDT / 08:00 PST, and should takes max 5h, 6h if needed to rollback.

During the window (2h):

  • cr1/cr2-eqiad:
delete interfaces xe-3/1/7 disable
set interfaces xe-3/3/2 disable
  • cr1-eqiad: activate protocols bgp group Fundraising neighbor 208.80.154.201
  • cr2-eqiad: activate protocols bgp group Fundraising neighbor 208.80.154.203
  • pfw3-codfw:
delete security ike gateway ike-gateway-eqiad address 208.80.154.218
set security ike gateway ike-gateway-eqiad address 208.80.154.219
set security address-book global address pfw-eqiad 208.80.154.219/32
  • Repatch servers to new switch stack
HostnameOld portNew portNew device
indiumpfw1:ge-2/0/0ge-0/0/0fasw-c1a
payments1pfw1:ge-2/0/1ge-0/0/1fasw-c1a
payments3pfw1:ge-2/0/2ge-0/0/2fasw-c1a
frav1001pfw1:ge-2/0/3ge-0/0/3fasw-c1a
pay-lvs1001pfw1:ge-2/0/4ge-0/0/4fasw-c1a
frdev1001pfw1:ge-2/0/5ge-0/0/5fasw-c1a
telluriumpfw1:ge-2/0/6ge-0/0/6fasw-c1a
frpm1001pfw1:ge-2/0/7ge-0/0/7fasw-c1a
frlog1001pfw1:ge-2/0/8ge-0/0/8fasw-c1a
frauth1001pfw1:ge-2/0/9ge-0/0/9fasw-c1a
americiumpfw1:ge-2/0/10ge-0/0/10fasw-c1a
frqueue1001pfw1:ge-2/0/11ge-0/0/11fasw-c1a
frdb1002pfw1:ge-2/0/14ge-0/0/12fasw-c1a
payments2pfw2:ge-11/0/0ge-1/0/13fasw-c1b
payments4pfw2:ge-11/0/1ge-1/0/14fasw-c1b
pay-lvs1002pfw2:ge-11/0/3ge-1/0/15fasw-c1b
samariumpfw2:ge-11/0/5ge-1/0/16fasw-c1b
thuliumpfw2:ge-11/0/7ge-1/0/17fasw-c1b
bismuthpfw2:ge-11/0/8ge-1/0/18fasw-c1b
aluminiumpfw2:ge-11/0/9ge-1/0/19fasw-c1b
civi1001pfw2:ge-11/0/10ge-1/0/20fasw-c1b
frdb1001pfw2:ge-11/0/11ge-1/0/21fasw-c1b

After the migration (2h) testing:

  • Verify monitoring is green
  • Verify BGP sessions are UP (inc. pybal)
  • Do failover tests (unplug each devices and core links, verify failover time/behavior)
  • Verify NAT
  • Verify cross DC syncs

Rollback decision

  • Move mgmt to mgmt switch cf. T156397 **

Cleanup

  • cr1-eqiad:
delete protocols bgp group Fundraising neighbor 208.80.154.217
delete protocols bgp group Fundraising multipath
delete interfaces xe-3/3/2
  • cr2-eqiad:
delete protocols bgp group Fundraising neighbor 208.80.154.221
delete protocols bgp group Fundraising multipath
delete interfaces xe-3/3/2
  • pfw3-codw: delete firewall family inet filter loopback4 term allow_codfw from source-address 208.80.154.218/32
  • Remove dns entries
  • Remove rancid config
  • Remove from Icinga
  • Remove from LibreNMS

Unrack, final part of rack elevation, back to T169644

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2017-09-26T15:18:42Z] <XioNoX> starting eqiad frack switch to new infra - T174218

Failover tests

Rate: 1 ping per second

Tested itemping from bast1001 to telluriumping from external to pfw3ping from rigel to telluriumNotes
cr1 – pfw3a linkno lossno lossno loss
cr2 – pfw3b linkno lossno lossno loss
pfw3a node (secondary)no lossno lossno lossIcinga alert about cr interfaces down; ~6min to full up
pfw3b node (primary)169 pings lost30 pings lost179 pings lost~6min to full up; longer ping loss than expected, to be investigated
RG0 Manual failover; node0 to node114 pings lost7 pings lost41 pings lost
RG1 Manual failover; node1 to node00 ping lost0 ping lost0 ping lost
pfw3a – fasw-c8a link3 pings lost1 ping lost2 pings lost
pfw3b – fasw-c8b link0 ping lost0 ping lost0 ping lost
pfw3a – pfw3b control link25 pings lost24 loss3 pings lostRespects https://kb.juniper.net/InfoCenter/index?page=content&id=KB22717
pfw3a – pfw3b data link100% loss100% loss100% lossTo be escalated to JTAC

To be escalated to JTAC

JTAC noticed that the control link went down as the same time as the data/fabric link because of missed heartbeats, which shouldn't happen and if it happen should recover automatically.

To troubleshot it further, JTAC needs to reproduce the issue and do testing while the system is in fault mode, which means taking another (~30min) outage, then an upgrade later on if a real issue is found.
@Jgreen This failure scenario has a very low likelihood of happening (and even less to take down the whole cluster again), but let me know if we can/should dig more on the issue.

To be escalated to JTAC

JTAC noticed that the control link went down as the same time as the data/fabric link because of missed heartbeats, which shouldn't happen and if it happen should recover automatically.

To troubleshot it further, JTAC needs to reproduce the issue and do testing while the system is in fault mode, which means taking another (~30min) outage, then an upgrade later on if a real issue is found.
@Jgreen This failure scenario has a very low likelihood of happening (and even less to take down the whole cluster again), but let me know if we can/should dig more on the issue.

Based on a little more info from our side discussion, I think we should defer further testing at eqiad until January. I think it's fine to test at codfw, since we're running the same new software rev. there, to look for a regression.

Test in codfw was successful, no packet loss/issue.