⚓ T174218 Move eqiad frack to new infra

ayounsi created this task.Aug 25 2017, 10:44 PM

Restricted Application added a project: SRE. · View Herald TranscriptAug 25 2017, 10:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Aug 28 2017, 8:41 PM

ayounsi updated the task description. (Show Details)Aug 29 2017, 3:48 PM

ayounsi updated the task description. (Show Details)

ayounsi updated the task description. (Show Details)Aug 29 2017, 8:12 PM

Mentioned in SAL (#wikimedia-operations) [2017-09-26T15:18:42Z] <XioNoX> starting eqiad frack switch to new infra - T174218

Failover tests

Rate: 1 ping per second

Tested item	ping from bast1001 to tellurium	ping from external to pfw3	ping from rigel to tellurium	Notes
cr1 – pfw3a link	no loss	no loss	no loss
cr2 – pfw3b link	no loss	no loss	no loss
pfw3a node (secondary)	no loss	no loss	no loss	Icinga alert about cr interfaces down; ~6min to full up
pfw3b node (primary)	169 pings lost	30 pings lost	179 pings lost	~6min to full up; longer ping loss than expected, to be investigated
RG0 Manual failover; node0 to node1	14 pings lost	7 pings lost	41 pings lost
RG1 Manual failover; node1 to node0	0 ping lost	0 ping lost	0 ping lost
pfw3a – fasw-c8a link	3 pings lost	1 ping lost	2 pings lost
pfw3b – fasw-c8b link	0 ping lost	0 ping lost	0 ping lost
pfw3a – pfw3b control link	25 pings lost	24 loss	3 pings lost	Respects https://kb.juniper.net/InfoCenter/index?page=content&id=KB22717
pfw3a – pfw3b data link	100% loss	100% loss	100% loss	To be escalated to JTAC

ayounsi updated the task description. (Show Details)Sep 26 2017, 6:09 PM

ayounsi updated the task description. (Show Details)Sep 26 2017, 6:23 PM

To be escalated to JTAC

JTAC noticed that the control link went down as the same time as the data/fabric link because of missed heartbeats, which shouldn't happen and if it happen should recover automatically.

To troubleshot it further, JTAC needs to reproduce the issue and do testing while the system is in fault mode, which means taking another (~30min) outage, then an upgrade later on if a real issue is found.
@Jgreen This failure scenario has a very low likelihood of happening (and even less to take down the whole cluster again), but let me know if we can/should dig more on the issue.

In T174218#3644703, @ayounsi wrote:

To be escalated to JTAC

JTAC noticed that the control link went down as the same time as the data/fabric link because of missed heartbeats, which shouldn't happen and if it happen should recover automatically.

To troubleshot it further, JTAC needs to reproduce the issue and do testing while the system is in fault mode, which means taking another (~30min) outage, then an upgrade later on if a real issue is found.
@Jgreen This failure scenario has a very low likelihood of happening (and even less to take down the whole cluster again), but let me know if we can/should dig more on the issue.

Based on a little more info from our side discussion, I think we should defer further testing at eqiad until January. I think it's fine to test at codfw, since we're running the same new software rev. there, to look for a regression.

Test in codfw was successful, no packet loss/issue.

ayounsi mentioned this in T263833: tcp handshake failure between pfw3-eqiad and frlog1001:6514.Feb 1 2021, 3:40 PM

Hostname	Old port	New port	New device
indium	pfw1:ge-2/0/0	ge-0/0/0	fasw-c1a
payments1	pfw1:ge-2/0/1	ge-0/0/1	fasw-c1a
payments3	pfw1:ge-2/0/2	ge-0/0/2	fasw-c1a
frav1001	pfw1:ge-2/0/3	ge-0/0/3	fasw-c1a
pay-lvs1001	pfw1:ge-2/0/4	ge-0/0/4	fasw-c1a
frdev1001	pfw1:ge-2/0/5	ge-0/0/5	fasw-c1a
tellurium	pfw1:ge-2/0/6	ge-0/0/6	fasw-c1a
frpm1001	pfw1:ge-2/0/7	ge-0/0/7	fasw-c1a
frlog1001	pfw1:ge-2/0/8	ge-0/0/8	fasw-c1a
frauth1001	pfw1:ge-2/0/9	ge-0/0/9	fasw-c1a
americium	pfw1:ge-2/0/10	ge-0/0/10	fasw-c1a
frqueue1001	pfw1:ge-2/0/11	ge-0/0/11	fasw-c1a
frdb1002	pfw1:ge-2/0/14	ge-0/0/12	fasw-c1a
payments2	pfw2:ge-11/0/0	ge-1/0/13	fasw-c1b
payments4	pfw2:ge-11/0/1	ge-1/0/14	fasw-c1b
pay-lvs1002	pfw2:ge-11/0/3	ge-1/0/15	fasw-c1b
samarium	pfw2:ge-11/0/5	ge-1/0/16	fasw-c1b
thulium	pfw2:ge-11/0/7	ge-1/0/17	fasw-c1b
bismuth	pfw2:ge-11/0/8	ge-1/0/18	fasw-c1b
aluminium	pfw2:ge-11/0/9	ge-1/0/19	fasw-c1b
civi1001	pfw2:ge-11/0/10	ge-1/0/20	fasw-c1b
frdb1001	pfw2:ge-11/0/11	ge-1/0/21	fasw-c1b

Move eqiad frack to new infra
Closed, ResolvedPublic
Actions

Description

Related Objects

Event Timeline

Failover tests

Move eqiad frack to new infraClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Failover tests

Move eqiad frack to new infra
Closed, ResolvedPublic
Actions