Page MenuHomePhabricator

Network flap on cloudbackup2002
Closed, DuplicatePublic

Description

cloudbackup2002.codfw.wmnet paged and then recovered twice for loss of ping at Sun Nov 15 05:27:35 UTC 2020 and Sun Nov 15 05:39:03 UTC 2020. The host reported that it had not rebooted (uptime 362 days).

However, in dmesg:

[Sun Nov 15 05:08:24 2020] bnxt_en 0000:18:00.0 eno1np0: NIC Link is Down
[Sun Nov 15 05:12:48 2020] bnxt_en 0000:18:00.0 eno1np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Sun Nov 15 05:12:48 2020] bnxt_en 0000:18:00.0 eno1np0: FEC autoneg off encodings: None
[Sun Nov 15 05:19:56 2020] bnxt_en 0000:18:00.0 eno1np0: NIC Link is Down
[Sun Nov 15 05:25:37 2020] bnxt_en 0000:18:00.0 eno1np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Sun Nov 15 05:25:37 2020] bnxt_en 0000:18:00.0 eno1np0: FEC autoneg off encodings: None

That would seem like a strange time of day for anyone to bump a cable, so it seems like the network cable might be loose or something.

Event Timeline

Bstorm created this task.Nov 15 2020, 5:52 AM
Restricted Application added a project: SRE. · View Herald TranscriptNov 15 2020, 5:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey added a subscriber: elukey.Nov 15 2020, 8:29 AM

Good point @Peachey88, cloudbackup2002 is indeed in rack C7!

@Bstorm closing this task as duplicate of T267865, please re-open if I am missing something.

dcaro added a subscriber: dcaro.Nov 16 2020, 10:54 AM

Isn't it better to put this one as 'depends on'? That way when we check for issues with the host cloudbackup2002 we will find an open task, that still will depend on the actual switch task, and that too allows us to track if the host is up again correctly after the switch is up no?

(I'm kinda new so I don't know yet the flows, so I might be missing some context :) )