Page MenuHomePhabricator

DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )))
Closed, ResolvedPublic

Description

I recently started receiving notification alarms from Icinga for contint2001.mgmt/SSH):

Notification Type: PROBLEM

Service: SSH
Host: contint2001.mgmt
Address: 10.193.2.250
State: CRITICAL

That seems rather recent:

contint2001-mgmt-ssh.png (300×900 px, 32 KB)

https://icinga.wikimedia.org/cgi-bin/icinga/histogram.cgi?host=contint2001.mgmt&service=SSH

Kunal stated that it might be the codfw network flapping somehow but we could not find a related task.


statushostversionNew versionBIOS versionNew BIOS versionComments
[x]ores2005.mgmt2.402.81
[x]gerrit2001.mgmt2.212.81
[x]ms-fe2006.mgmt2.402.81
[x]wdqs2001.mgmt2.302.81
[x]wdqs2002.mgmt2.302.81
[]logstash2021.mgmtoffline
[]logstash2022.mgmtoffline
[x]contint2001.mgmt2.212.812.3.42.12Reset IDRAC
[x]mw2253.mgmt2.402.812.3.42.12Reset IDRAC
[x]mw2255.mgmt2.402.812.3.42.13Reset IDRAC

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I went to fetch the IRC log from https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-operations/ which are from May 22nd. For codfw hosts (assuming they match s/2...\.mgmt/:

grep -o '.*icinga.*PROBLEM.*2...\.mgmt is CRITICAL' *|grep -o on.*mgmt|sort|uniq -c|sort -n
      1 on ores2005.mgmt
      3 on gerrit2001.mgmt
      5 on ms-fe2006.mgmt
     24 on wdqs2001.mgmt
     24 on wdqs2002.mgmt
     30 on logstash2021.mgmt
     52 on contint2001.mgmt

wdqs2002 and ms-fe2006 are PowerEdge R430 with some date indicating 2016 as well.

Non codfw excluding hosts that had only one alarm:

 2 on mw1273.mgmt
 2 on wdqs1006.mgmt
 3 on mw1305.mgmt
 4 on mw1297.mgmt
13 on mw1279.mgmt
14 on mw1303.mgmt
24 on bast5001.mgmt
46 on cp5005.mgmt
46 on mw1284.mgmt
56 on analytics1069.mgmt

analytics1069 is a PowerEdge R730xd

wdqs1006, mw1303, mw1305, bast5001, cp5005 are PowerEdge R430

mw1279, mw1284 I could not find them. They, may have been decommissioned

We are seeing this issue because all those hosts are running an old firmware version for the IDRAC. Upgrading the IDRAC on some of those servers in the past did fix the problem. It is not a management switch issue.

@Papaul Could we schedule a firmware upgrade for gerrit2001 due to this issue? (not high prio)

@cmooney Thank you very much for all the debugging effort you put into this and thanks @Papaul for confirming it is indeed an issue of firmware upgrades.

Cathal, would you still like to keep this ticket assigned to you? Papaul, should we turn this into a tracking ticket for firmware upgrades with checkboxes of affected hosts?

@Dzahn I will go for turning this into a tracking ticket for firmware upgrades with check boxes of affected hosts.

Sorry @Dzahn I should have updated it before now. Makes sense to re-assign to DC-Ops I think.

@Papaul I think we can start with these hosts:

ores2005.mgmt
gerrit2001.mgmt
ms-fe2006.mgmt
wdqs2001.mgmt
wdqs2002.mgmt
logstash2021.mgmt
logstash2022.mgmt
contint2001.mgmt

What might be good is if we can confirm the iDRAC firmware version(s) on these first. Then possibly we can do a query to identify a full list of devices with that version?

Is the best thing to do now edit the task subject / description to change it to the tracking task as suggested?

Dzahn renamed this task from Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) to DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )).Oct 19 2021, 4:29 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

affected hosts I am ACKing right now in Icinga:

contint2001.mgmt
ms-fe2006.mgmt
mw2253.mgmt

@Papaul Could you maybe get the current versions for the ones I put in the ticket description now?

@cmooney No worries at all. All sounds good to me. I started editing ticket name and description indeed, added some check boxes and a version table.

ores2005.mgmtPER430
gerrit2001.mgmtPER430
ms-fe2006.mgmtPER430
wdqs2001.mgmtPER430
wdqs2002.mgmtPER430
logstash2021.mgmtServer is offine
logstash2022.mgmtServer is offline
contint2001.mgmtPER430

@Papaul @Dzahn I had a go at enumerating the iDrac firmware version on our Dell R430s. Some of the connections timed out, but for those that didn't it resulted in the following list of codfw hosts using firmware 2.4 and below (matching the pattern in the affected hosts).

I could observe even with the API calls the more recent versions responded quickly, whereas the older ones took ages. Not very scientific but it does seem to confirm the older version has issues, even if we don't always see the ping timeouts/alerts for systems on that revision. That said I'm not sure we've had problems with all of these so perhaps the issue is slightly more complex than just the firmware version.

Host                                   Version
-------------------------------------------------
cloudcephmon2002-dev.mgmt.codfw.wmnet  2.40.40.40                             
furud.mgmt.codfw.wmnet                 2.40.40.40                             
ganeti2007.mgmt.codfw.wmnet            2.40.40.40                             
ganeti2008.mgmt.codfw.wmnet            2.40.40.40                             
krb2001.mgmt.codfw.wmnet               2.40.40.40                             
kubernetes2001.mgmt.codfw.wmnet        2.40.40.40                             
kubernetes2002.mgmt.codfw.wmnet        2.40.40.40                             
kubernetes2003.mgmt.codfw.wmnet        2.40.40.40                             
ms-fe2005.mgmt.codfw.wmnet             2.40.40.40                             
ms-fe2007.mgmt.codfw.wmnet             2.40.40.40                             
ms-fe2008.mgmt.codfw.wmnet             2.40.40.40                             
mw2252.mgmt.codfw.wmnet                2.40.40.40                             
mw2253.mgmt.codfw.wmnet                2.40.40.40                             
mw2254.mgmt.codfw.wmnet                2.40.40.40                             
mw2255.mgmt.codfw.wmnet                2.40.40.40                             
mw2257.mgmt.codfw.wmnet                2.40.40.40                             
mw2258.mgmt.codfw.wmnet                2.40.40.40                             
ores2001.mgmt.codfw.wmnet              2.40.40.40                             
ores2002.mgmt.codfw.wmnet              2.40.40.40                             
ores2003.mgmt.codfw.wmnet              2.40.40.40                             
ores2004.mgmt.codfw.wmnet              2.40.40.40                             
ores2006.mgmt.codfw.wmnet              2.40.40.40                             
ores2007.mgmt.codfw.wmnet              2.40.40.40                             
ores2008.mgmt.codfw.wmnet              2.40.40.40                             
ores2009.mgmt.codfw.wmnet              2.40.40.40                             
pki2001.mgmt.codfw.wmnet               2.40.40.40                             
prometheus2003.mgmt.codfw.wmnet        2.40.40.40                             
prometheus2004.mgmt.codfw.wmnet        2.40.40.40                             
restbase2010.mgmt.codfw.wmnet          2.40.40.40                             
restbase2012.mgmt.codfw.wmnet          2.40.40.40                             
thumbor2003.mgmt.codfw.wmnet           2.40.40.40                             
thumbor2004.mgmt.codfw.wmnet           2.40.40.40

There are many more in eqiad, but we don't seem to observe the issue there as much (I'm wondering if it's just a race condition and the additional latecny from monitoring to codfw is responsible?). Full list in the attached file anyway.

There are many more in eqiad, but we don't seem to observe the issue there as much

Yes, there are some but not as much as codfw to the point that first it looked like codfw-only, it's not though. But fewer and not as often.

Right now I do see these examples though:

contint1001.mgmt
kubernetes1003.mgmt

So the ones alerting in eqiad are one case of 2.30.30.30 and one case of "404: API Endpoint Not Found". I guess let's just get rid of all 2.30.* cases first.

@Dzahn I need mw2253 and contint2001 down for me to reset the IDRAC before upgrading.

Thanks.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T14:49:06Z] <mutante> depooling mw2253 for DRAC upgrade (T283582)

@Papaul mw2253 is not a problem. done. it's shut down and downtimed.

contint2001 we have to coordinate with @hashar though

@Papaul Let's go ahead with mw2253.

For contint2001 please consider it stalled and do NOT take down, it is currently the main for CI. I made T294271 for that.

Papaul updated the task description. (Show Details)

@Dzahn mw2253 done

@Papaul. Thank you!

mw2253:

  • scap pulled
  • confirmed icinga green
  • repooled to production

@Papaul Afraid this is a long story. just saw mw2255.mgmt alerting in Icinga.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:47:04Z] <mutante> mw2255 - depooled=inactive (incl "dsh groups"), shut down physically for T283582 - can be worked on anytime

Mentioned in SAL (#wikimedia-operations) [2021-10-27T20:47:43Z] <mutante> mw2255 - scap pull, repooling - after DRAC firmware was upgraded - T283582

Thanks @Papaul ! it's back in service now

I am not sure what is next exactly in this ticket. Currently I see no such alerts in Icinga but that can always change of course.

I also see no more 2.3x in codfw and eqiad upgrades should be treated separately.

So while we have a bunch of 2.4x left now I am not sure if the goal is to upgrade them ALL now or if we just go "on demand" and keep this ticket open for a while and just watch it and add affected hosts when/if they pop-up. It's a bit hard to tell when exactly we should close it.

P.S. except contint2001 subtask is still open, yep

@Dzahn thank you. I think it is best to just close this task and go "on demand" since must of those servers were purchased in 2016 refresh is next year.

Dzahn claimed this task.

I agree and boldly resolve it, expecting to reopen / add servers if I see them pop up in Icinga. Thanks for all the work on it. Icinga is less noisy already for sure.

I've seen this alert pop up a few times in the last few days, is it related?

[09:16:02] <+icinga-wm> PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook

one case of mw2252.mgmt right now

@AntiCompositeNumber Not for sure but it seems likely that it could also be fixed with firmware upgrades, yes. That being said, it still does not technically belong into this ticket because it is supposed to be "per data center" and this only covers the hosts in codfw while the one you report is in eqiad. So if there are more of these then technically it should be a new ticket like this but for ops-eqiad. Before we do that though.. we need to weigh the effort of fixing it on old hardware vs the actual benefit we get from it. There is the option to ignore it and permanently downtime it. I would say it depends how often and how many hosts.

I no more receive alarms from contint2001.mgmt which was the purpose of this task. When looking at Icinga it returns Error: Host Not Found!! May we get the management host check back?

Has contint2001.mgmt DRAC been updated?

contint1001.mgmt started alerting a few weeks ago, I got four alarms over the course of the night https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=contint1001.mgmt&service=SSH . So I guess it will need a DRAC update as well.

db2083 and db2086 were affected today

for the record: I have absolutely no idea why contint2001.mgmt disappeared from icinga

Dzahn removed Dzahn as the assignee of this task.Jan 7 2022, 7:32 PM

@Papaul Do you know about contint2001.mgmt status?

Dzahn renamed this task from DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) to contint2001.mgmt disappeared from Icinga (was: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))).Jan 7 2022, 7:46 PM
Dzahn added a project: observability.
hashar renamed this task from contint2001.mgmt disappeared from Icinga (was: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) to DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))).Jan 10 2022, 8:29 AM
hashar removed a project: observability.

for the record: I have absolutely no idea why contint2001.mgmt disappeared from icinga

I have spinned a standalone task for that T298861 , it is probably related to some refactoring of the monitoring puppet classes.

The reason I have reopened this task is to check contint1001 and see whether its IDRAC has been updated, if not we would have to upgrade it as well to prevent the alarm spam ;)

ACK, understood @hashar, then this goes back to ops-codfw

@Papaul it seems that "broken DRAC" is actually the reason for both things, the alerts here and that it disappeared entirely from Icinga, per T298861#7608653 could you schedule downtime with @hashar

@Papaul wrote:

The IDRAC on this server needs reset. Please coordinate a day and time that is best for this server to be taken off line.

The machine hosts the CI servers and interruptions are quite disruptive to our developers and for deployments. Ideally we should avoid conflict with scheduled window at https://wikitech.wikimedia.org/wiki/Deployments

Anytime in your morning (which is my late afternoon) will be fine, my calendar amusso@ wikimedia.org should be up to date.

@hashar since Monday is a Holiday, let is do this on the 18th at 10am CT. Thanks

@Papaul ack, I have send the following announcement to ops-l and wikitech-l.

The continuous integration server contint2001 will be restarted for a 
hardware maintenance on Tuesday January 18th at 16:00 UTC. During the 
maintenance, the CI systems will be unavailable:

- Jenkins
- Zuul
- https://integration.wikimedia.org/

The out-of-band management system requires an update to address 
intermittent loss of connectivity.  We have to restart the server.


Time conversions:

PST  8:00
CT  10:00
UTC 16:00
CET 17:00

And I have added it to the deployments calendar https://wikitech.wikimedia.org/wiki/Deployments#Tuesday%2C_January_18

@hashar let me know when this is offline so i can take over

Mentioned in SAL (#wikimedia-operations) [2022-01-18T15:59:52Z] <hashar> Shutting down CI for maintenance on contint2001 # T283582

@Papaul the machine is shutting down. I am on IRC if you want to sync up.

reset IDRAC, uograde BIOS and IDRAC.

hashar assigned this task to Papaul.

I have restarted ferm.

Zuul/Jenkins seems to behave properly. Thank you @Papaul for the upgrade!

@hashar no problem you can close the task once all is back online.

CI had to be restarted after the machine went up due to some oddities. The system is fully back up now.Thank you @Papaul!