Page MenuHomePhabricator

DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))
Closed, ResolvedPublic

Description

I recently started receiving notification alarms from Icinga for contint2001.mgmt/SSH):

Notification Type: PROBLEM

Service: SSH
Host: contint2001.mgmt
Address: 10.193.2.250
State: CRITICAL

That seems rather recent:

contint2001-mgmt-ssh.png (300×900 px, 32 KB)

https://icinga.wikimedia.org/cgi-bin/icinga/histogram.cgi?host=contint2001.mgmt&service=SSH

Kunal stated that it might be the codfw network flapping somehow but we could not find a related task.


statushostversionNew versionBIOS versionNew BIOS versionComments
[x]ores2005.mgmt2.402.81
[x]gerrit2001.mgmt2.212.81
[x]ms-fe2006.mgmt2.402.81
[x]wdqs2001.mgmt2.302.81
[x]wdqs2002.mgmt2.302.81
[]logstash2021.mgmtoffline
[]logstash2022.mgmtoffline
[]contint2001.mgmt2.21
[x]mw2253.mgmt2.402.812.3.42.12Reset IDRAC
[x]mw2255.mgmt2.402.812.3.42.13Reset IDRAC

Event Timeline

Marostegui edited projects, added serviceops, netops; removed SRE.
Marostegui added subscribers: Legoktm, Marostegui.

Tagging also netops in case they can help out

I'd defer to DCops
mgmt is connected to un-managed switches, so we don't have much visibility on this side.
Either the server's mgmt port is faulty, or the software, or the cable or the switch port.
I'd recommend to try to change/fix each of them until it's fully fixed, starting with the low hanging ones.

Looking at the alarm history since midnight UTC at https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=all , I have filtered out alarms matching 2\d\d\d.mgmt. There are a few hosts:

contint2001.mgmt
gerrit2001.mgmt
logstash2021.mgmt
logstash2022.mgmt
wdqs2001.mgmt

Though they are in different rows / racks in the datacenter.

I can confirm since a while these have been happening. The pattern is always:

  • only mgmt
  • only codfw
  • random host
  • resolves itself shortly after

That makes me think it should be the management switch.

ACKed some more today, gerrit2001.mgmt, wdqs2002.mgmt

That is still happening from time to time. Any person or team I can raise this too?

@hashar This should be between netops and dcops I think.

Clearing a few projects. @ayounsi mentioned at T283582#7111164 that it is most probably an issue with an unmanaged switch for the management network.

cmooney added a project: netops.
cmooney added a subscriber: cmooney.

Thanks @hashar. I would agree with @ayounsi's analysis, if considering contint2001.mgmt in isolation.

But given your subsequent reply and Daniel's response confirming it is spread across racks/rows, but confined to codfw, I think it's unlikely to be the result of issues on the unmanaged management switches. It's unlikely we'd have multiple of them faulty, and restricted only to codfw. It's not impossible that we purchased a dodgy batch of Cat5 cables which have been used there, I did experience an issue like that before, but it's extremely unlikely.

So, given the codfw only pattern, I think it's more likely to be something with the mr1-codfw firewall, or msw1-codfw aggregation switch (that connects all the unmanaged ones). I've checked in LibreNMS and the associated links, CPU checks and other relevant metrics look good for both of these devices, so nothing is jumping out at me. But I can't say for sure there is no problem at this layer.

I observe some packet loss when sending high-speed pings from mr1-codfw to any device, for instance pinging the CRs over direct link. However when sending such pings to devices connected via management switches the results are much better towards the CR mgmt interfaces than to server mgmt interfaces:

CR via uplink:

cmooney@mr1-codfw> ping 208.80.153.206 size 1400 do-not-fragment rapid count 1000 
PING 208.80.153.206 (208.80.153.206): 1400 data bytes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- 208.80.153.206 ping statistics ---
1000 packets transmitted, 993 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.277/2.028/57.115/2.675 ms

CR via msw1-codfw:

cmooney@mr1-codfw> ping 10.193.0.12 size 1400 do-not-fragment rapid count 1000     
PING 10.193.0.12 (10.193.0.12): 1400 data bytes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- 10.193.0.12 ping statistics ---
1000 packets transmitted, 997 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.181/3.004/683.287/22.545 ms

contint2001.mgmt via msw1-codfw:

cmooney@mr1-codfw> ping 10.193.2.250 size 1400 do-not-fragment rapid count 1000    
PING 10.193.2.250 (10.193.2.250): 1400 data bytes
!!!!!!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!.
--- 10.193.2.250 ping statistics ---
1000 packets transmitted, 340 packets received, 66% packet loss
round-trip min/avg/max/stddev = 1.354/9.492/404.538/44.024 ms

Being honest the pattern looks like ICMP rate-limiting. And the Dell iDRAC sub-system is known to have as much processing power as a casio watch, so this could be a total red herring.

To that last point though, I wonder could the issue be the iDRAC modules on these boxes? I've seen issues with those in a past life, timing out to SNMP and SSH connections, so maybe that's it. Looking at the hosts reported above I notice they are all R430 models purchased in 2015/2016:

HostModelPurchase Date
contint2001Dell PowerEdge R430 (1U)2016-03-24
gerrit2001Dell PowerEdge R430 (1U)2016-03-24
logstash2021Dell PowerEdge R4302015-12-07
logstash2022Dell PowerEdge R4302016-11-03
wdqs2001Dell PowerEdge R4302016-08-15

I've limited time to do more checks right now. But I think it'd be worth trying to pull a longer list of affected servers, and try to confirm the server models and then the iDRAC firmware version on each. Possibly a firmware update, or even just an iDRAC reset on them, would help.

I'll assign this to myself for now. May not have a huge amount of time to work on it in the coming weeks but will see what I can dig up.

That is quite an epic diagnostic @cmooney ! It is definitely not trivial to end up root causing some specific piece of hardware as a common cause. Well done!

I went to fetch the IRC log from https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-operations/ which are from May 22nd. For codfw hosts (assuming they match s/2...\.mgmt/:

grep -o '.*icinga.*PROBLEM.*2...\.mgmt is CRITICAL' *|grep -o on.*mgmt|sort|uniq -c|sort -n
      1 on ores2005.mgmt
      3 on gerrit2001.mgmt
      5 on ms-fe2006.mgmt
     24 on wdqs2001.mgmt
     24 on wdqs2002.mgmt
     30 on logstash2021.mgmt
     52 on contint2001.mgmt

wdqs2002 and ms-fe2006 are PowerEdge R430 with some date indicating 2016 as well.

Non codfw excluding hosts that had only one alarm:

 2 on mw1273.mgmt
 2 on wdqs1006.mgmt
 3 on mw1305.mgmt
 4 on mw1297.mgmt
13 on mw1279.mgmt
14 on mw1303.mgmt
24 on bast5001.mgmt
46 on cp5005.mgmt
46 on mw1284.mgmt
56 on analytics1069.mgmt

analytics1069 is a PowerEdge R730xd

wdqs1006, mw1303, mw1305, bast5001, cp5005 are PowerEdge R430

mw1279, mw1284 I could not find them. They, may have been decommissioned

We are seeing this issue because all those hosts are running an old firmware version for the IDRAC. Upgrading the IDRAC on some of those servers in the past did fix the problem. It is not a management switch issue.

@Papaul Could we schedule a firmware upgrade for gerrit2001 due to this issue? (not high prio)

@cmooney Thank you very much for all the debugging effort you put into this and thanks @Papaul for confirming it is indeed an issue of firmware upgrades.

Cathal, would you still like to keep this ticket assigned to you? Papaul, should we turn this into a tracking ticket for firmware upgrades with checkboxes of affected hosts?

@Dzahn I will go for turning this into a tracking ticket for firmware upgrades with check boxes of affected hosts.

Sorry @Dzahn I should have updated it before now. Makes sense to re-assign to DC-Ops I think.

@Papaul I think we can start with these hosts:

ores2005.mgmt
gerrit2001.mgmt
ms-fe2006.mgmt
wdqs2001.mgmt
wdqs2002.mgmt
logstash2021.mgmt
logstash2022.mgmt
contint2001.mgmt

What might be good is if we can confirm the iDRAC firmware version(s) on these first. Then possibly we can do a query to identify a full list of devices with that version?

Is the best thing to do now edit the task subject / description to change it to the tracking task as suggested?

Dzahn renamed this task from Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) to DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )).Oct 19 2021, 4:29 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

affected hosts I am ACKing right now in Icinga:

contint2001.mgmt
ms-fe2006.mgmt
mw2253.mgmt

@Papaul Could you maybe get the current versions for the ones I put in the ticket description now?

@cmooney No worries at all. All sounds good to me. I started editing ticket name and description indeed, added some check boxes and a version table.

ores2005.mgmtPER430
gerrit2001.mgmtPER430
ms-fe2006.mgmtPER430
wdqs2001.mgmtPER430
wdqs2002.mgmtPER430
logstash2021.mgmtServer is offine
logstash2022.mgmtServer is offline
contint2001.mgmtPER430

@Papaul @Dzahn I had a go at enumerating the iDrac firmware version on our Dell R430s. Some of the connections timed out, but for those that didn't it resulted in the following list of codfw hosts using firmware 2.4 and below (matching the pattern in the affected hosts).

I could observe even with the API calls the more recent versions responded quickly, whereas the older ones took ages. Not very scientific but it does seem to confirm the older version has issues, even if we don't always see the ping timeouts/alerts for systems on that revision. That said I'm not sure we've had problems with all of these so perhaps the issue is slightly more complex than just the firmware version.

Host                                   Version
-------------------------------------------------
cloudcephmon2002-dev.mgmt.codfw.wmnet  2.40.40.40                             
furud.mgmt.codfw.wmnet                 2.40.40.40                             
ganeti2007.mgmt.codfw.wmnet            2.40.40.40                             
ganeti2008.mgmt.codfw.wmnet            2.40.40.40                             
krb2001.mgmt.codfw.wmnet               2.40.40.40                             
kubernetes2001.mgmt.codfw.wmnet        2.40.40.40                             
kubernetes2002.mgmt.codfw.wmnet        2.40.40.40                             
kubernetes2003.mgmt.codfw.wmnet        2.40.40.40                             
ms-fe2005.mgmt.codfw.wmnet             2.40.40.40                             
ms-fe2007.mgmt.codfw.wmnet             2.40.40.40                             
ms-fe2008.mgmt.codfw.wmnet             2.40.40.40                             
mw2252.mgmt.codfw.wmnet                2.40.40.40                             
mw2253.mgmt.codfw.wmnet                2.40.40.40                             
mw2254.mgmt.codfw.wmnet                2.40.40.40                             
mw2255.mgmt.codfw.wmnet                2.40.40.40                             
mw2257.mgmt.codfw.wmnet                2.40.40.40                             
mw2258.mgmt.codfw.wmnet                2.40.40.40                             
ores2001.mgmt.codfw.wmnet              2.40.40.40                             
ores2002.mgmt.codfw.wmnet              2.40.40.40                             
ores2003.mgmt.codfw.wmnet              2.40.40.40                             
ores2004.mgmt.codfw.wmnet              2.40.40.40                             
ores2006.mgmt.codfw.wmnet              2.40.40.40                             
ores2007.mgmt.codfw.wmnet              2.40.40.40                             
ores2008.mgmt.codfw.wmnet              2.40.40.40                             
ores2009.mgmt.codfw.wmnet              2.40.40.40                             
pki2001.mgmt.codfw.wmnet               2.40.40.40                             
prometheus2003.mgmt.codfw.wmnet        2.40.40.40                             
prometheus2004.mgmt.codfw.wmnet        2.40.40.40                             
restbase2010.mgmt.codfw.wmnet          2.40.40.40                             
restbase2012.mgmt.codfw.wmnet          2.40.40.40                             
thumbor2003.mgmt.codfw.wmnet           2.40.40.40                             
thumbor2004.mgmt.codfw.wmnet           2.40.40.40

There are many more in eqiad, but we don't seem to observe the issue there as much (I'm wondering if it's just a race condition and the additional latecny from monitoring to codfw is responsible?). Full list in the attached file anyway.

There are many more in eqiad, but we don't seem to observe the issue there as much

Yes, there are some but not as much as codfw to the point that first it looked like codfw-only, it's not though. But fewer and not as often.

Right now I do see these examples though:

contint1001.mgmt
kubernetes1003.mgmt

So the ones alerting in eqiad are one case of 2.30.30.30 and one case of "404: API Endpoint Not Found". I guess let's just get rid of all 2.30.* cases first.

@Dzahn I need mw2253 and contint2001 down for me to reset the IDRAC before upgrading.

Thanks.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T14:49:06Z] <mutante> depooling mw2253 for DRAC upgrade (T283582)

@Papaul mw2253 is not a problem. done. it's shut down and downtimed.

contint2001 we have to coordinate with @hashar though

@Papaul Let's go ahead with mw2253.

For contint2001 please consider it stalled and do NOT take down, it is currently the main for CI. I made T294271 for that.

Papaul updated the task description. (Show Details)

@Dzahn mw2253 done

@Papaul. Thank you!

mw2253:

  • scap pulled
  • confirmed icinga green
  • repooled to production

@Papaul Afraid this is a long story. just saw mw2255.mgmt alerting in Icinga.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:47:04Z] <mutante> mw2255 - depooled=inactive (incl "dsh groups"), shut down physically for T283582 - can be worked on anytime

Mentioned in SAL (#wikimedia-operations) [2021-10-27T20:47:43Z] <mutante> mw2255 - scap pull, repooling - after DRAC firmware was upgraded - T283582

Thanks @Papaul ! it's back in service now

I am not sure what is next exactly in this ticket. Currently I see no such alerts in Icinga but that can always change of course.

I also see no more 2.3x in codfw and eqiad upgrades should be treated separately.

So while we have a bunch of 2.4x left now I am not sure if the goal is to upgrade them ALL now or if we just go "on demand" and keep this ticket open for a while and just watch it and add affected hosts when/if they pop-up. It's a bit hard to tell when exactly we should close it.

P.S. except contint2001 subtask is still open, yep

@Dzahn thank you. I think it is best to just close this task and go "on demand" since must of those servers were purchased in 2016 refresh is next year.

Dzahn claimed this task.

I agree and boldly resolve it, expecting to reopen / add servers if I see them pop up in Icinga. Thanks for all the work on it. Icinga is less noisy already for sure.

I've seen this alert pop up a few times in the last few days, is it related?

[09:16:02] <+icinga-wm> PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook