Page MenuHomePhabricator

db2034 host crashed; mgmt interface unavailable (needs reset and hw check)
Closed, ResolvedPublic

Description

db2034 crashed while in a middle of cloning its mysql data to another host.

It does not respond to SSH, ping, salt. mgmt interface is also unavailable. We require a hard reset- and keep an eye for any hardware issue that may have caused this.

(could it be related to T109282?)

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added a project: ops-codfw.
jcrespo subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Cmjohnson subscribed.

Papaul,

Could you please troubleshoot this before you leave. Thanks

Checked the server, the server was completely off. Power of the server, the iLo configuration were stay in place. I couldn't ssh@localIP but i can access the server using the WEB GUI https://LocalIP with no problem. i can ping the localIP from loal network and from bast2002 with no problem as well, can not ssh@localIP of ssh@db2034.

Reset the ILO and unplugged the server from power for a few minutes also didn't fix the problem the ssh problem

The issue looks like a network/board problem, right?

This has been resolved to me, unless, @Papaul, you want to add anything strange that you found and may be the cause of the issue. I will keep an eye on this server in the future. I can also do a stress test (CPU was under high stress at the time of the poweroff).

Sorry, @Papaul, I misunderstood you. SSH is actually broken, but on the mgmt interface.

We can check it when you are back, at least the server is up and running.

Papaul also mentioned a potential RAID degradation, I will check it.

@Papaul

ssh root@db2034.mgmt.codfw.wmnet
Received disconnect from 10.193.1.84: 11: Client Disconnect

^ can you do a hard reset of the DRAC ?

@Dzahn I did alread hard reset but it didn't work same problem

I found out that the same ssh problem on db2034 are on 8 others boxes . (db2035 to db2042) discussed this with Jynus on IRC he mentioned that he notice that only on 2 boxes.

I login to the web interface on two different servers. 1 with ssh working and other one with no working ssh i found out that the one with ssh working had a iLO Firmware Version 2.03 and the one with no working ssh had iLO Firmware Version 1.5 on db2034 i download iLO Firmware Version 2.30 . After upgrading the iLO Firmware Version on db2034 , ssh is now working. i am going to upgrade the other db servers that have the same problem.

@chris please see below the link to download the iLO Firmware if you have the same problem.
http://h20566.www2.hpe.com/hpsc/swd/public/detail?sp4ts.oid=5194969&swItemId=MTX_d154c6611a304c468f6a13085e&swEnvOid=4168#tab3

This comment was removed by Dzahn.

@Papaul cool! nice work. confirmed SSH works again. disregard my former comment, i had the tab open from earlier and not seen the latest updates.

hit resolved?

I update the iLO Firmware from 1.5 to 2.30 on db20[3-4][0-9] . ssh is working now for those systems . I am closing this task.

Thanks