Page MenuHomePhabricator

msw1-a6-eqiad flopping up and down mgmt connections on A6
Closed, ResolvedPublic

Description

Initial description:
db1096 MANAGEMENT keeps working locally (through local ipmi commands) but stopped working for both remote ipmi and ssh access. It is not a case of wrong password, as just the connection times out. Most likely, the network has some issues (IP conflict or cable physically disconnected, or link down, as it happened other times with other host). A cold reset was tried, following the guide- the reset worked but the connectivity issues continued. Either checking the cables/network equipment or a power drain should be tried next.

Note the host corresponding to this management is into critical production, let us know any work beyond tightening the mgmt interface ethernet cable so we can depool the whole host.

Update:
This flapping up and down observed on:

  • db1096.mgmt
  • mw1309.mgmt
  • mw1311.mgmt
  • ganeti1006.mgmt

All on A6, so most likely a switch issue.

Event Timeline

Marostegui triaged this task as Medium priority.Apr 20 2020, 7:46 AM
Marostegui added subscribers: Cmjohnson, Jclark-ctr.

For the record this is an s5 slave.

Marostegui moved this task from Triage to In progress on the DBA board.

I saw a few hosts flop down and up its host up status on icinga, all on A6: in additiona to db1096.mgmt, mw1311.mgmt and ganeti1006.mgmt CC @ayounsi

In fact, db1096.mgmt SSH worked for a small window, down again now.

ayounsi raised the priority of this task from Medium to High.Apr 20 2020, 9:24 AM

As msw-a6-eqiad is 9yo and there are no errors on the msw1-eqiad side I'd say let's replace it.

jcrespo renamed this task from db1096 management interface unresposive remotelly, likely connectivity issue to msw1-a6-eqiad flopping up and down mgmt connections on A6.Apr 20 2020, 9:42 AM
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)

I have a spare Netgear switch already in storage, I will request access to the cage and complete this on Tuesday 4/21 @wiki_willy can we order a replacement to the backup, please?

@Cmjohnson - we have a refresh for the eqiad management switches scheduled to be ordered this quarter, so I'll check with Rob to see when those are coming in. If it's going to be a while, we'll just order a couple more spares beforehand. Thanks, Willy

T249048 was approved last Friday (today being Monday), and my plan is to place the info into Coupa later today for ordering. I don't think we'll need another task for just a one off switch, as this should come in just as fast.

Mentioned in SAL (#wikimedia-operations) [2020-04-21T15:40:19Z] <cmjohnson1> replacing mgmt switch on a6-eqiad T250652

Replaced the management switch, updated netbox

Thanks, Chris, for the prompt response!

Thanks! Re-opening so we don't forget to update the cable in Netbox as well.

updated cable number