Page MenuHomePhabricator

asw1-eqsin: VC mastership change
Closed, ResolvedPublic

Description

The failover got triggered by connecting a new host to an unused port and caused an outage the time the mastership completes.

Nov 15 04:50:52  asw1-eqsin /kernel: xe-1/0/20: get tlv ppfeid 0
Nov 15 04:50:52  asw1-eqsin rpd[1915]: ifl_delete: ifl 0x95c2f88 xe-1/0/20.16386 index 571 refcount 1 irb_bdid 32768 8acc174 8acef60 8ace386 8ace3b3 8a22688 8a115cf 8a11629 8b1172c 8b0c096 81f700b 81f6759
Nov 15 04:50:52  asw1-eqsin rpd[1915]: EVENT <Delete> xe-1/0/20.16386 index 571 <Broadcast Multicast> address #0 58.0.bb.2a.7e.77
Nov 15 04:50:52  asw1-eqsin rpd[1915]: EVENT <Delete> xe-1/0/20 index 698 <Broadcast Multicast> address #0 58.0.bb.2a.7e.77
Nov 15 04:50:52  asw1-eqsin rpd[1915]: STP handler: IFD =NULL, op=delete, state=Discarding, Topo change generation=0
Nov 15 04:50:52  asw1-eqsin rpd[1915]: *STP Delete*, notify to other modules
Nov 15 04:50:54  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 300/300/5
Nov 15 04:50:54  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 33015/62939 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/5
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6157/59631 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/5
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6166/64609 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/5
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6343/63578 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/5
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6987/55654 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/5
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6985/54948 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:50:58  asw1-eqsin ksyncd[1917]: ksyncd_ipc_dispatch: shutting down connection with master: Operation timed out
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/5
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6234/64031 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/5
Nov 15 04:50:58  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6234/52289 faddr/laddr: 128.0.0.1/128.0.0.5
Nov 15 04:50:58  asw1-eqsin ksyncd[1917]: closing connection to master
Nov 15 04:51:03  asw1-eqsin ksyncd[1917]: ksyncd_select_control_plane_proto: rhost_sysctlbyname_get: (master:1) No such file or directory
Nov 15 04:51:04  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 600/600/20
Nov 15 04:51:04  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 6011/51970 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:51:09  asw1-eqsin ksyncd[1917]: ksyncd_select_control_plane_proto: rhost_sysctlbyname_get: (master:1) No such file or directory
Nov 15 04:51:12  asw1-eqsin /kernel: tcp_timer_keep: Dropping socket connection due to keepalivetimer expiration, idle/intvl/cnt: 1000/1000/20
Nov 15 04:51:12  asw1-eqsin /kernel: tcp_timer_keep: fport/lport: 7000/64945 faddr/laddr: 128.0.0.1/128.0.0.16
Nov 15 04:51:12  asw1-eqsin dc-pfe: CMLC: Going disconnected; Routing engine chassis socket closed abruptly
Nov 15 04:51:22  asw1-eqsin bdbrepd: Subscriber Management is not ready for GRES
Nov 15 04:51:22  asw1-eqsin vccpd[1742]: VCCPD_PROTOCOL_ADJDOWN: Lost adjacency to 5800.bb2a.7e60 on vcp-255/0/24,
Nov 15 04:51:22  asw1-eqsin rpd[1915]: EVENT <UpDown> vcp-255/0/24.32768 index 64 <Broadcast Multicast>
Nov 15 04:51:22  asw1-eqsin vccpd[1742]: VCCPD_PROTOCOL_ADJDOWN: Lost adjacency to 5800.bb2a.7e60 on vcp-255/0/25,
Nov 15 04:51:22  asw1-eqsin rpd[1915]: EVENT <UpDown> vcp-255/0/24 index 128 <Broadcast Multicast>
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: CHASSISD_VCHASSIS_MEMBER_UPDATE_NOTICE: Membership update: Member 0->0, Mode Backup->Backup, 1M 0B, Master Unchanged, Members Unchanged
Nov 15 04:51:22  asw1-eqsin rpd[1915]: EVENT <UpDown> vcp-255/0/25.32768 index 65 <Broadcast Multicast>
Nov 15 04:51:22  asw1-eqsin rpd[1915]: EVENT <UpDown> vcp-255/0/25 index 129 <Broadcast Multicast>
Nov 15 04:51:22  asw1-eqsin vccpd[1742]: interface vcp-255/0/24 went down
Nov 15 04:51:22  asw1-eqsin vccpd[1742]: interface vcp-255/0/25 went down
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: self ID 0 self mode 0 master ch 0 master de 0 back ch 0 back de 0 memb ch 0
Nov 15 04:51:22  asw1-eqsin init: can not access /usr/sbin/gstatd: No such file or directory
Nov 15 04:51:22  asw1-eqsin init: gstatd (PID 0) started
Nov 15 04:51:22  asw1-eqsin vccpd[1742]: Member 0, interface vcp-255/0/25 went down
Nov 15 04:51:22  asw1-eqsin vccpd[1742]: Member 0, interface vcp-255/0/24 went down
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: CHASSISD_VCHASSIS_MEMBER_UPDATE_NOTICE: Membership update: Member 0->0, Mode Backup->Backup, -1M 0B, Master Changed, Members Changed
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: self ID 0 self mode 0 master ch 1 master de 0 back ch 0 back de 0 memb ch 1
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: CHASSISD_VCHASSIS_MEMBER_LIST_NOTICE: Members: 0B
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 1
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: ch_device_ore_change, 2013: ore_present_alive 0 ore_slotid -1
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: [consoleredirect]: No pid
Nov 15 04:51:22  asw1-eqsin chassisd[1734]: [chassism]: No pid
Nov 15 04:51:22  asw1-eqsin init: can not access /usr/sbin/gstatd: No such file or directory
Nov 15 04:51:23  asw1-eqsin init: gstatd (PID 0) started
Nov 15 04:51:23  asw1-eqsin vccpd[1742]: vcdb_write_db2file: open /config/vchassis/vc.tlv.db.1 for write
Nov 15 04:51:23  asw1-eqsin vccpd[1742]: vcdb_write_db2file: open /config/vchassis/vc.tlv.db for write
Nov 15 04:51:23  asw1-eqsin vccpd[1742]: vcdb_write_db2file: open /config/vchassis/vc.db for write
Nov 15 04:51:23  asw1-eqsin rpd[1915]: EVENT <UpDown> bme0.32769 index 5 <Up Broadcast Multicast> address #0 2.0.0.0.0.a
Nov 15 04:51:23  asw1-eqsin /kernel: DEVRT_CHANGE: Mode: B->M
Nov 15 04:51:23  asw1-eqsin chassisd[1734]: DEVRT_CHANGE: Mode: B->M
Nov 15 04:51:23  asw1-eqsin chassisd[1734]: self ID 0 self mode 1 master ch 0 master de 1 back ch 1 back de 0 memb ch 1
Nov 15 04:51:23  asw1-eqsin /kernel: mastership: routing engine 0 becoming master tnp_local_addr:0
Nov 15 04:51:23  asw1-eqsin /kernel: update_slave_peer_gres_status: vksid 0 is_slave_peer_gres_ready 0 is_local_slave_peer_gres_ready 0
Nov 15 04:51:23  asw1-eqsin /kernel: vks[0] 0 vks[1] 0
Nov 15 04:51:23  asw1-eqsin /kernel: PFE-MASTER - vks[0] 1 vks[1] 1
Nov 15 04:51:23  asw1-eqsin /kernel: PURGE AE pending table: 0 entries and 0 child NHs removed
Nov 15 04:51:23  asw1-eqsin /kernel: pfe_listener_disconnect: conn dropped: issu_master = 0,listener idx=12, tnpaddr=0x11000080, reason: invalid stack member
Nov 15 04:51:23  asw1-eqsin chassisd[1734]: CHASSISD_VCHASSIS_MEMBER_UPDATE_NOTICE: Membership update: Member 0->0, Mode Backup->Master, 0M -1B, Master Detected, Members Changed
Nov 15 04:51:23  asw1-eqsin chassisd[1734]: CHASSISD_VCHASSIS_MEMBER_LIST_NOTICE: Members: 0M
Nov 15 04:51:23  asw1-eqsin chassisd[1734]: ch_device_ore_change, 2002: ore_present_alive 0 ore_slotid -1
Nov 15 04:51:23  asw1-eqsin chassisd[1734]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc change of member 0
Nov 15 04:51:23  asw1-eqsin irsd.log[1739]: DEVRT_CHANGE: Mode: B->M
Nov 15 04:51:23  asw1-eqsin irsd.log[1739]: irsd_rts_async_ifa_msg: Received async ifa message for bme0.0 op 3 ip:0x80000006
Nov 15 04:51:23  asw1-eqsin init: kernel-replication (PID 1917) exited with status=0 Normal Exit
Nov 15 04:51:23  asw1-eqsin irsd.log[1739]: irsd_rts_async_ifa_msg: Received async ifa message for bme0.0 op 3 ip:0x80000005
Nov 15 04:51:23  asw1-eqsin irsd.log[1739]: irsd_rts_async_ifa_msg: Received async ifa message for bme0 op 2 ip:0x80000010
Nov 15 04:51:23  asw1-eqsin irsd.log[1739]: irsd_rts_async_ifa_msg: Received async ifa message for bme0 op 1 ip:0x80000001
Nov 15 04:51:23  asw1-eqsin irsd.log[1739]: irsd_rts_async_ifa_msg: Received async ifa message for bme0 op 1 ip:0x80000004
Nov 15 04:51:23  asw1-eqsin rpd[1915]: ifl_delete: ifl 0x95c0270 bme0.32769 index 5 refcount 1 irb_bdid 32768 8acc174 8a1fcc1 8a115cf 8a11629 8b1172c 8b0c096 81f700b 81f6759
Nov 15 04:51:23  asw1-eqsin rpd[1915]: EVENT <Delete UpDown> bme0.32769 index 5 <Broadcast Multicast> address #0 2.0.0.0.0.a
Nov 15 04:51:23  asw1-eqsin rpd[1915]: ifl_delete: ifl 0x95c03a8 em0.0 index 6 refcount 1 irb_bdid 32768 8acc174 8a1fcc1 8a115cf 8a11629 8b1172c 8b0c096 81f700b 81f6759
Nov 15 04:51:23  asw1-eqsin rpd[1915]: EVENT <Delete UpDown> em0.0 index 6 <Broadcast Multicast> address #0 58.0.bb.2a.7e.61
Nov 15 04:51:23  asw1-eqsin rpd[1915]: EVENT Delete UpDown bme0.0 index 3 128.0.0.6/2 -> 191.255.255.255 <Broadcast Multicast Localup>
Nov 15 04:51:23  asw1-eqsin rpd[1915]: EVENT Delete UpDown bme0.0 index 3 128.0.0.5/2 -> 191.255.255.255 <Broadcast Multicast Localup>
Nov 15 04:51:23  asw1-eqsin rpd[1915]: EVENT Add bme0.0 index 3 128.0.0.1/2 -> 191.255.255.255 <Up Broadcast Multicast Localup>
Nov 15 04:51:23  asw1-eqsin rpd[1915]: EVENT Add bme0.0 index 3 128.0.0.4/2 -> 191.255.255.255 <Up Broadcast Multicast Localup>
Nov 15 04:51:23  asw1-eqsin dc-pfe: DEVRT_CHANGE: Mode: B->M
Nov 15 04:51:23  asw1-eqsin init: lacp (PID 1992) sending signal usr1: due to "proto-mastership": 0x1
Nov 15 04:51:23  asw1-eqsin chassisd[1734]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach_fpc(1)
Nov 15 04:51:23  asw1-eqsin init: l2cpd-service (PID 22353) started
[...]

And then plenty of logs related to the VC mastership failover

Event Timeline

ayounsi triaged this task as High priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@RobH I couldn't find any open task so I opened this one, in the future please make sure a task is opened as soon as any issue happens.
Can you also let us know the impact of the event? you mention Icinga alerts, was there any user impact?
What is the exact order of the events? Was the switch port configured before the server being plugged in or the other way around? Was any rollback attempted?

JTAC case 2022-1115-586910 opened.

I would also like to understand why eqsin was not immediately depooled when it happened.
The patch was already ready on Gerrit. There was some user impact and no clear expectation, when the first Icinga alerts and the page happened, that it would resolve itself after few minutes or remain stable after that.

Note that we will need to move the mastership back to asw-0604-eqsin to keep everything standardized. For that, better depool the site.

I would also like to understand why eqsin was not immediately depooled when it happened.
The patch was already ready on Gerrit. There was some user impact and no clear expectation, when the first Icinga alerts and the page happened, that it would resolve itself after few minutes or remain stable after that.

Please note I was in the process of pulling up the patchset to merge when the failure happened and they started to clear. I saw green for all the various cp hosts, and it appeared that eqsin had recovered.

The first batch of alerts were mgmt, leading me to think we simply had a msw power disconnect, since they are single feed in eqsin. Then when all the other alerts scrolled by I started to pull up the patchset to depool eqsin, and then they appeared to recover. With the discussions earlier in the day with Traffic and discussion of how depooling eqsin would adversely affect users, I decided not to depool it when it appeared to be recovering.

I admit I should have perhaps called and woken folks up, but it was still overnight for EU and it appeared better, but seems I was wrong.

Apologies.

Order of operations, all of this was at or just before 5AM UTC.

  • Jin racks new hosts, and plugs in cp5032 - the port was NOT setup and the server was not in netbox at this time. The port was disabled on the switch (and showed disabled in netbox)
  • icinga alerts of mgmt loss in rack 604
  • we start to look if the msw was rebooted, checking the cable, etc.
  • other alerts stream in for servers being wholly offline and then the asw crash notice emails
  • I started to pull up the patchset to merge, but then noticed recoveries happening
  • recoveries streamed across icigna and all items went green

I didn't realize the 5xx spike had been so drastic, I didn't have https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1&from=1668487382007&to=1668488908178&viewPanel=8 up and I should have.

The analysis of the core dump by JTAC showed that we were victim of this bug https://prsearch.juniper.net/problemreport/PR1080132
Even though we're running a Junos version that shouldn't be impacted: 14.1X53-D45.3 fixed in 14.1X53-D30

The trigger being:

when an interface is flapping or when you issue the CLI command "clear ether-switch table"

And consequence:

The fxpc process crashes and generates the core file, it will cause traffic interruption until the service is restarted

(it's automatically restarted by a watchdog)

Thus:

We should refrain from or at least depool the site when performing any action that could cause a physical interface to flap

Until we have upgraded the switch stack (and prioritize this upgrade with T316532)

As we have multiple other VC stack (all eqiad/codfw/ulsfo/fasw) running 14.1X53 even thought not the exact same D45, an open question is are they impacted as well? Seeing all the flapping they experience everyday, I'd guess not.

RobH changed the task status from Open to Stalled.Dec 21 2022, 8:36 PM
RobH moved this task from Backlog to Hardware Failure / Repair on the ops-eqsin board.

I'm setting this to stalled as the upgrade parent task should resolve this issue.

I think this actually should be over to Arzhel, as the mastership change is something they're fixing with the upgrade parent task.

Arzhel: Did you want this to remain open to track the issue, or resolve this as task T316532 encompasses a fix for this?

That's all done.