elastic2020 is powered off and does not want to restart
Open, NormalPublic

Description

elastic2020.codfw.wmnet is marked as down in icinga. By using the management console, I can check that the server is powered off (see log below). power on does not seem to work, the server still reports being powered off.

@Papaul I think this will require your expert hands.

gehel@durin:~$ ssh root@elastic2020.mgmt.codfw.wmnet
root@elastic2020.mgmt.codfw.wmnet's password: 
User:root logged-in to ILOMXQ526080P.dasher.com(10.193.2.217 / FE80::EEB1:D7FF:FE78:2BBC)

iLO 4 Advanced 2.20 at  May 20 2015
Server Name: 
Server Power: Off

Based on customer feedback, we will be enhancing the SSH command line
interface in a future release of the iLO 4 firmware.  Our future CLI will
focus on increased usability and improved functionality.  This message is
to provide advance notice of the coming change.  Please see the iLO 4 
Release Notes on www.hp.com/go/iLO for additional information.


</>hpiLO-> power off hard

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:46 2016

Server power already off.




</>hpiLO-> power reset

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:54 2016

Server power off.




</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon Oct 24 12:10:03 2016



Server powering on .......



</>hpiLO-> vsp

Virtual Serial Port Active: COM2
 The server is not powered on.  The Virtual Serial Port is not available.

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.
Gehel created this task.Oct 24 2016, 8:16 PM
Restricted Application added projects: Operations, Discovery-Search. · View Herald TranscriptOct 24 2016, 8:16 PM
Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald Transcript
Gehel added a comment.Oct 24 2016, 8:22 PM

Note that the load on the codfw cluster was fairly high at the time of the failure, due to the fresh deployment on BM25 (T147508).

RobH added a subscriber: RobH.Oct 24 2016, 8:40 PM

I'd suggest fully removing all power from the system (pulling the power cords) and allowing a full power loss event. This will soft reset the lights out manager and may resolve the issue. Then see if it doesn't work for power on/off/reset via remote commands.

If that doesn't work, then proceed with further troubleshooting.

Papaul triaged this task as "Normal" priority.Oct 25 2016, 2:32 PM
Papaul claimed this task.
Papaul closed this task as "Resolved".Oct 25 2016, 3:34 PM

System is back up on-line.

dcausse reopened this task as "Open".Dec 12 2016, 3:53 PM
dcausse added subscribers: akosiaris, dcausse.

Reopening, this host went down today few hours after we switched all search traffic to codfw.
@akosiaris tried to power it up in vain.
It's very suspicious that this host went down again, the first time it was just after traffic switchover to codfw.

Reopening

The server is exhibiting the exact same symptoms. It reports it was powered off by power removal

</>hpiLO-> show map1/log1/record286

status=0
status_tag=COMMAND COMPLETED
Mon Dec 12 07:47:34 2016



/map1/log1/record286
  Targets
  Properties
    number=286
    severity=Informational
    date=12/12/2016
    time=07:38
    description=Server power removed.

and will not power on again remotely.

@Papaul, any memories on how you fixed it last time ?

@akosiaris i just removed the PSU's for a couple of minutes and plugged them back in. The server is back up but i am working with HP now to investigate the issue. I will update the task once I have any updates.

I contact HP, according to them the log file I sent to them is not showing any hardware failure and showing only 1 power supply, possible reason might be that the system is running an out dated ILO version that is why the log is not accurate. They suggestion is to update the whole system with the SP2 disks and upload to them once again the new log.

The tech will call me tomorrow at 10:30 am to follow up once he gets with new log and possible schedule an on site tech to check on the issue.

@akosiaris can you please setup a maintenance window for this server tomorrow Dec 13 between 9:30am and 11 am?
Thanks

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5315671772
Status: Case is generated and in Progress

Product description: HP ProLiant DL360 Gen9 E5-2640v3 2.6GHz 8-core 2P 16GB-R P440ar 8 SFF 500W RPS Server/S-Buy
Product number: 780019-S01
Serial number: MXQ526080P
Subject: DL360 Gen9 - Server shuts down

Yours sincerely,
Hewlett Packard Enterprise

Mentioned in SAL (#wikimedia-operations) [2016-12-13T08:40:43Z] <akosiaris> depool elastic2020, T149006

Depooled and powered off. @Papaul server is ready for maintenance.

@akosiaris Thanks
Firmware update complete, I am waiting on HP to call me so I can provide them with the new log.

Before firmware update


After firmware update

Spend an hour with HP on the phone.The HP person i spoke to name is Chandi. They came to the conclusion that since the system was running an outdated firmware version (2015) and now that we update the firmware we shouldn't have this issue anymore.

@dcausse When are your guys doing the switch over again?

system is back up for now

All search is currently served by codfw, we are expecting to switch it back to eqiad in the next few days (after some mainteneance has finished). There will be a deployment freeze soon, but after that we can switch traffic to codfw for a few hours one day to see if it triggers the issue again.

@EBernhardson Thanks I will leave this task open for now.

elastic2020 is now repooled. Traffic is still flowing to codfw, but no large shards are allocated on elastic2020 at the moment, let's see if it stays up this time.

@Gehel Has everything gone as planned? I assume silence on this ticket is good news. :-)

Gehel added a comment.Dec 21 2016, 7:11 PM

Silence is a good thing! But traffic has left codfw again, and not long after the firmware upgrade by @Papaul.

So it works, but we have not put all that much stress on the system yet... We could close this ticket and reopen if the problem materialize again. @Deskana : your call!

Silence is a good thing! But traffic has left codfw again, and not long after the firmware upgrade by @Papaul.

So it works, but we have not put all that much stress on the system yet... We could close this ticket and reopen if the problem materialize again. @Deskana : your call!

That's my preference. Closing and opening tickets is cheap. Easy come, easy go! Absolutely, feel free to reopen if there are any issues.

Deskana closed this task as "Resolved".
dcausse reopened this task as "Open".Thu, Mar 16, 7:59 PM

Reopening it happened today in exactly the same conditions, few minutes after a switchover

Mentioned in SAL (#wikimedia-operations) [2017-03-16T20:06:59Z] <mutante> depooled elastic2010 since it is powered-off/down. (set/pooled=inactive) - (T149006)

Mentioned in SAL (#wikimedia-operations) [2017-03-16T20:08:35Z] <mutante> repooled elastic2010, depooled correct host elastic2020 instead (T149006)

Dzahn added a subscriber: Dzahn.Thu, Mar 16, 8:13 PM

@Papaul confirmed it has the same behaviour again. It shows as status "powered down", then you can tell it to power on and it claims it is powering on.. but if you connect to console it still claims "is not powered on". I guess we should repeat what you did last time ("removed the PSU's for a couple of minutes and plugged them back"?) and contact HP about this happening again on the same hardware.

Mentioned in SAL (#wikimedia-operations) [2017-03-21T07:50:07Z] <gehel> banning elastic2020 from cluster to investigate T149006

Mentioned in SAL (#wikimedia-operations) [2017-03-21T08:43:43Z] <gehel> shutting down elasticsearch on elastic2020, investigating T149006

Gehel closed this task as "Resolved".Tue, Mar 21, 8:56 AM

Running bonnie++ as documented on T153083#2886085 to see if I/O stress as an influence on stability.

Mentioned in SAL (#wikimedia-operations) [2017-03-21T12:47:36Z] <gehel> running stress and bonnie on elastic2020 - T149006

stress is launched with stress --cpu 28 --vm 4

Gehel reopened this task as "Open".Tue, Mar 21, 3:18 PM

I resolved this by mistake, re-opening.

Gehel added a comment.Tue, Mar 21, 3:26 PM

After ~25' of stress + bonnie elastic2020 crashed again. That seem to indicate a systematic issue. Test can be seen on grafana. Test started at ~12:45UTC and server crash at ~13:10UTC.

@Papaul now that we mostly have a way to reproduce the issue, what can we do about it?

Gehel added a comment.Tue, Mar 21, 3:29 PM

elastic2020 is banned from elasticsearch cluster and has a 1 month downtime in icinga. Let's figure out what we can do with it before re-enabling icinga.

Gehel added a comment.Wed, Mar 22, 8:59 AM

Investigation will continue with @Papaul and @Gehel on Thursday March 23 4pm CET (8am PT)