elastic2020 is powered off and does not want to restart
Open, NormalPublic

Description

elastic2020.codfw.wmnet is marked as down in icinga. By using the management console, I can check that the server is powered off (see log below). power on does not seem to work, the server still reports being powered off.

@Papaul I think this will require your expert hands.

gehel@durin:~$ ssh root@elastic2020.mgmt.codfw.wmnet
root@elastic2020.mgmt.codfw.wmnet's password: 
User:root logged-in to ILOMXQ526080P.dasher.com(10.193.2.217 / FE80::EEB1:D7FF:FE78:2BBC)

iLO 4 Advanced 2.20 at  May 20 2015
Server Name: 
Server Power: Off

Based on customer feedback, we will be enhancing the SSH command line
interface in a future release of the iLO 4 firmware.  Our future CLI will
focus on increased usability and improved functionality.  This message is
to provide advance notice of the coming change.  Please see the iLO 4 
Release Notes on www.hp.com/go/iLO for additional information.


</>hpiLO-> power off hard

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:46 2016

Server power already off.




</>hpiLO-> power reset

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:54 2016

Server power off.




</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon Oct 24 12:10:03 2016



Server powering on .......



</>hpiLO-> vsp

Virtual Serial Port Active: COM2
 The server is not powered on.  The Virtual Serial Port is not available.

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.
There are a very large number of changes, so older changes are hidden. Show Older Changes
Papaul triaged this task as "Normal" priority.Oct 25 2016, 2:32 PM
Papaul claimed this task.
Papaul closed this task as "Resolved".Oct 25 2016, 3:34 PM

System is back up on-line.

dcausse reopened this task as "Open".Dec 12 2016, 3:53 PM
dcausse added subscribers: akosiaris, dcausse.

Reopening, this host went down today few hours after we switched all search traffic to codfw.
@akosiaris tried to power it up in vain.
It's very suspicious that this host went down again, the first time it was just after traffic switchover to codfw.

Reopening

The server is exhibiting the exact same symptoms. It reports it was powered off by power removal

</>hpiLO-> show map1/log1/record286

status=0
status_tag=COMMAND COMPLETED
Mon Dec 12 07:47:34 2016



/map1/log1/record286
  Targets
  Properties
    number=286
    severity=Informational
    date=12/12/2016
    time=07:38
    description=Server power removed.

and will not power on again remotely.

@Papaul, any memories on how you fixed it last time ?

@akosiaris i just removed the PSU's for a couple of minutes and plugged them back in. The server is back up but i am working with HP now to investigate the issue. I will update the task once I have any updates.

I contact HP, according to them the log file I sent to them is not showing any hardware failure and showing only 1 power supply, possible reason might be that the system is running an out dated ILO version that is why the log is not accurate. They suggestion is to update the whole system with the SP2 disks and upload to them once again the new log.

The tech will call me tomorrow at 10:30 am to follow up once he gets with new log and possible schedule an on site tech to check on the issue.

@akosiaris can you please setup a maintenance window for this server tomorrow Dec 13 between 9:30am and 11 am?
Thanks

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5315671772
Status: Case is generated and in Progress

Product description: HP ProLiant DL360 Gen9 E5-2640v3 2.6GHz 8-core 2P 16GB-R P440ar 8 SFF 500W RPS Server/S-Buy
Product number: 780019-S01
Serial number: MXQ526080P
Subject: DL360 Gen9 - Server shuts down

Yours sincerely,
Hewlett Packard Enterprise

Mentioned in SAL (#wikimedia-operations) [2016-12-13T08:40:43Z] <akosiaris> depool elastic2020, T149006

Depooled and powered off. @Papaul server is ready for maintenance.

@akosiaris Thanks
Firmware update complete, I am waiting on HP to call me so I can provide them with the new log.

Before firmware update


After firmware update

Spend an hour with HP on the phone.The HP person i spoke to name is Chandi. They came to the conclusion that since the system was running an outdated firmware version (2015) and now that we update the firmware we shouldn't have this issue anymore.

@dcausse When are your guys doing the switch over again?

system is back up for now

All search is currently served by codfw, we are expecting to switch it back to eqiad in the next few days (after some mainteneance has finished). There will be a deployment freeze soon, but after that we can switch traffic to codfw for a few hours one day to see if it triggers the issue again.

@EBernhardson Thanks I will leave this task open for now.

elastic2020 is now repooled. Traffic is still flowing to codfw, but no large shards are allocated on elastic2020 at the moment, let's see if it stays up this time.

@Gehel Has everything gone as planned? I assume silence on this ticket is good news. :-)

Gehel added a comment.Dec 21 2016, 7:11 PM

Silence is a good thing! But traffic has left codfw again, and not long after the firmware upgrade by @Papaul.

So it works, but we have not put all that much stress on the system yet... We could close this ticket and reopen if the problem materialize again. @Deskana : your call!

Silence is a good thing! But traffic has left codfw again, and not long after the firmware upgrade by @Papaul.

So it works, but we have not put all that much stress on the system yet... We could close this ticket and reopen if the problem materialize again. @Deskana : your call!

That's my preference. Closing and opening tickets is cheap. Easy come, easy go! Absolutely, feel free to reopen if there are any issues.

Deskana closed this task as "Resolved".
dcausse reopened this task as "Open".Mar 16 2017, 7:59 PM

Reopening it happened today in exactly the same conditions, few minutes after a switchover

Mentioned in SAL (#wikimedia-operations) [2017-03-16T20:06:59Z] <mutante> depooled elastic2010 since it is powered-off/down. (set/pooled=inactive) - (T149006)

Mentioned in SAL (#wikimedia-operations) [2017-03-16T20:08:35Z] <mutante> repooled elastic2010, depooled correct host elastic2020 instead (T149006)

Dzahn added a subscriber: Dzahn.Mar 16 2017, 8:13 PM

@Papaul confirmed it has the same behaviour again. It shows as status "powered down", then you can tell it to power on and it claims it is powering on.. but if you connect to console it still claims "is not powered on". I guess we should repeat what you did last time ("removed the PSU's for a couple of minutes and plugged them back"?) and contact HP about this happening again on the same hardware.

Mentioned in SAL (#wikimedia-operations) [2017-03-21T07:50:07Z] <gehel> banning elastic2020 from cluster to investigate T149006

Mentioned in SAL (#wikimedia-operations) [2017-03-21T08:43:43Z] <gehel> shutting down elasticsearch on elastic2020, investigating T149006

Gehel closed this task as "Resolved".Mar 21 2017, 8:56 AM

Running bonnie++ as documented on T153083#2886085 to see if I/O stress as an influence on stability.

Mentioned in SAL (#wikimedia-operations) [2017-03-21T12:47:36Z] <gehel> running stress and bonnie on elastic2020 - T149006

stress is launched with stress --cpu 28 --vm 4

Gehel reopened this task as "Open".Mar 21 2017, 3:18 PM

I resolved this by mistake, re-opening.

Gehel added a comment.Mar 21 2017, 3:26 PM

After ~25' of stress + bonnie elastic2020 crashed again. That seem to indicate a systematic issue. Test can be seen on grafana. Test started at ~12:45UTC and server crash at ~13:10UTC.

@Papaul now that we mostly have a way to reproduce the issue, what can we do about it?

Gehel added a comment.Mar 21 2017, 3:29 PM

elastic2020 is banned from elasticsearch cluster and has a 1 month downtime in icinga. Let's figure out what we can do with it before re-enabling icinga.

Gehel added a comment.Mar 22 2017, 8:59 AM

Investigation will continue with @Papaul and @Gehel on Thursday March 23 4pm CET (8am PT)

Gehel added a comment.Mar 24 2017, 9:19 AM

We managed to crash that server again, with the same test (stress + bonnie). @Papaul is running a full H/W diagnostic. Server will remain banned from cluster until we get to the bottom of this.

Here is the result of the HW diagnostic.

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5318387483
Status: Case is generated and in Progress

Product description: HP ProLiant DL360 Gen9 E5-2640v3 2.6GHz 8-core 2P 16GB-R P440ar 8 SFF 500W RPS Server/S-Buy
Product number: 780019-S01
Serial number: MXQ526080P
Subject: DL360 Gen9 - Server Crash Issue

Yours sincerely,
Hewlett Packard Enterprise

@Gehel I open a case with HP and we are working to find a solution to this issue.

Mentioned in SAL (#wikimedia-operations) [2017-03-28T15:56:45Z] <gehel> banning elastic2021 to run same tests as elastic2020 - T149006

@Gehel Been on the phone with HP for about 45 minutes. went over all the logs files they requested and can't find any potential HW cause for this issue. According to the HP guy the OS (Debian) we are running is not supported by the system since it doesn't have the drivers for the system. His recommendation was to remove Debian and install Windows on the system and reproduce the crash (upset..wish doesn't make since) . I told him this can not be done. His argument for 5 minutes was only to remove Debian from the system.

Since we are not seeing any HW issue on the system i think we need to investigate more on our side to see what is causing the server to crash.

@Gehel once you finish testing elastic2021, if it does not crash we will have to take it down for me to compare BIOS settings with elastic2021. if it does crash then we know that the problem is not just on elastic2020.

Thanks

Gehel added a comment.Wed, Mar 29, 2:59 PM

The same kind of tests as we did on elastic2020 are running on elastic2021 at the moment. This should help validate that there is an issue with elastic2020 itself and not with our overall configuration.

The same tests have been run on elastic2021 (stress + bonnie++), multiple times, with some pause between runs. elastic2021 has not crashed under that load. Note that when running the tests on elastic2020, much shorter pauses between tests were used. I'm still going to run 2 tests back to back (same as was done on elastic2020) and see what happens. If that does not crash elastic2021, I will put in back into the cluster.

@Gehel Thanks. Once that done i will also update the task on the troubleshooting steps of eastic2020.

Gehel added a comment.EditedThu, Mar 30, 7:39 AM

After multiple tests, generating CPU, memory and IO load on elastic2021, the server has not crashed. Those tests are the same as the tests that crashed elastic2020. The timings can be observed on Grafana.

Conclusion: the is something special on elastic2020 that make it crash under load, which cannot be reproduced on elastic2021. @Papaul: I'll let you take over from here.

Mentioned in SAL (#wikimedia-operations) [2017-03-30T07:41:40Z] <gehel> pull elastic2021 back into active duty - T149006

1st crash
Date: October 24, 2016
Troubleshooting : removed both PSU's for a couple of minutes

2nd crash
Date Dcember 12, 2017
Trobleshoooting: Called HP and provide them with all log files
reference number 5315671772
According to HP the logs are not showing any hardware issue; however, the system is running out dated firmware so they recommendation is to update the firmware . All firmwares were updated on the system

3rd crash
Date: March 16, 2017
Troubleshooting Checked all the Settings in the BIOS were correct
Performed a complete hardware diagnostic on the system, this diagnostic took a day to complete
Result of the diagnostic:
Hard drive short DST check: WARRING
Hard drive long DST check: WARRING
Called HP for support
reference number:reference number 5318387483
uploaded all the log files to HP and HP coundn't tell what was causing the problem and affirm that it is not a HW issue since there were nothing in the logs.

Gehel perform the same test on another identicaly system (elasetic2021) and the system didn't crash. This conclude that the prolem has noting to do we the OS or anything we are running on the system.
Since HP uses log files to determinate that there is a Hardware issue on a system and the logs are not showing anything even though there is a problem .

@RobH Since HP will not do anything on this case, what is the next step?

faidon reassigned this task from Papaul to RobH.Mon, Apr 3, 4:44 PM
faidon added a subscriber: faidon.

First off, HPE officially supports Debian, so that technician was incorrect here -- and his suggestion to install Windows is absurd.

Second, @RobH, please escalate with HPE about this and ask them for a mainboard or system replacement. Happy to be added in the loop here as an escalation point.

Papaul added a comment.Fri, Apr 7, 5:40 AM

@Gehel @RobH I spoke again yesterday with the HP Engineer that did help me on the lvs2002(T162099) issue about this case and after going over the log and taking the into consideration what i fond about the Hard drive warring that the previous HP Engineer didn't take time to investigate, here is what he thinks:

Hi Paul,

As discussed on the call, I noticed that you were using Intel SSDs on the server and these SSDs do not support HPE diagnostics on them and therefore we are unable to pull any details about the SSD from our Smart Storage Administrator. This issue you are facing could possibly be caused by the SSDs you are using but there is no way for HPE to confirm that since these are Intel SSDs.

Thanks & Regards,

Joe added a subscriber: Joe.Fri, Apr 7, 5:45 AM

@Gehel @RobH I spoke again yesterday with the HP Engineer that did help me on the lvs2002(T162099) issue about this case and after going over the log and taking the into consideration what i fond about the Hard drive warring that the previous HP Engineer didn't take time to investigate, here is what he thinks:

Hi Paul,

As discussed on the call, I noticed that you were using Intel SSDs on the server and these SSDs do not support HPE diagnostics on them and therefore we are unable to pull any details about the SSD from our Smart Storage Administrator. This issue you are facing could possibly be caused by the SSDs you are using but there is no way for HPE to confirm that since these are Intel SSDs.

Thanks & Regards,

My best guess is they're trying to divert attention from the real problem trying to blame whatever is not covered by their support.

If this keeps happening (HP being so unsupportive and in general a nightmare to interact with), I suggest we start thinking of dropping them altoghether as a vendor.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704071529_gehel_18197.log.

Mentioned in SAL (#wikimedia-operations) [2017-04-07T15:32:17Z] <gehel> reimaging elstic2020 - T149006

Mentioned in SAL (#wikimedia-operations) [2017-04-10T08:48:40Z] <gehel> reimage elastic2020 - T149006

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704100849_gehel_11746.log.

This looks similar to: https://phabricator.wikimedia.org/T149553
Which took us quite some time to debug, but in the end it was a faulty CPU.

Mentioned in SAL (#wikimedia-operations) [2017-04-10T10:58:42Z] <gehel> starting load test on elstic2020 - T149006

Gehel added a comment.Mon, Apr 10, 5:17 PM

@Papaul has put in place new SSD in that server.

I've been running the same kind of load test as before for most of the day (see Grafana for details) and the server did not crash.

It looks to me that changing the SSD has a significant impact on the stability of that server.

@Marostegui how did you diagnose the CPU issue?

@Papaul / @RobH I'll let you move forward on this. Ping me if there is anything I can do...

RobH added a comment.Mon, Apr 10, 9:16 PM

It seems that these were in the initial Dasher orders where Intel disks were Dasher supported, not HP.

@Papaul: Can you provide me with the model # and serial # of the defective SSD?

Drive Model ATA INTEL SSDSC2BB80

@RobH I was able to pull the information from the HW diagnostic i did last week please see below for information

Disk 1
SCSI Bus 0 (0x00)
SCSIID 1 (0x01)
Block Size 512 Bytes Per Block (0x0200)
Total Blocks 800 GB (0x5d26ceb0)
Reserved Blocks 0x00010000
Drive Model ATA INTEL SSDSC2BB80
Drive Serial Number PHWL524504C0800RGN
Drive Firmware Revision D2010370
SCSI Inquiry Bits 0x02
Compaq Drive Stamped Stamped For Monitoring (0x01)
Last Failure Reason No Failure (0x00)

Disk 2
SCSI Bus 0 (0x00)
SCSIID 0 (0x00)
Block Size 512 Bytes Per Block (0x0200)
Total Blocks 800 GB (0x5d26ceb0)
Reserved Blocks 0x00010000
Drive Model ATA INTEL SSDSC2BB80
Drive Serial Number PHWL524504SB800RGN
Drive Firmware Revision D2010370
SCSI Inquiry Bits 0x02
Compaq Drive Stamped Stamped For Monitoring (0x01)
Last Failure Reason No Failure (0x00)

RobH added a comment.Mon, Apr 10, 9:49 PM

This lists two SSDs, which one is the failed one?

RobH added a comment.Mon, Apr 10, 11:24 PM

Update from IRC:

Papaul wasn't sure which SSD failed, he just pulled both. He'll place one of the two back in and run the diagnostics again and see if it fails, and then do the same with the other one.

That way we'll know which is bad.

@Marostegui how did you diagnose the CPU issue?

At some point we changed the mainboard but not the CPUs apparently, so it took a bit longer as I would have assumed that changing the mainboard would also change the CPUs, but apparently the HP technician didn't do that.
In the end, by chance, in one of the crashes we saw:

description=Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000004, Status 0xB2000000'71000402, Address 0x00000000'00000000, Misc 0x00000000'00000000)

Most of the crashes did not log stuff if you haven't cleared the logs manually before.
The server was only crashing with high IO wait operations, so that is why we thought it was the disks or the raid controller at the start, but it wasn't.
The whole history is here: T149553 it is a long read, but it is interesting :)

Mentioned in SAL (#wikimedia-operations) [2017-04-12T08:44:30Z] <gehel> reimaging elastic2020 for testing - T149006

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704120844_gehel_14370.log.

Mentioned in SAL (#wikimedia-operations) [2017-04-12T09:42:19Z] <gehel> starting load on elastic2020 - T149006

Gehel added a comment.Wed, Apr 12, 1:27 PM

elastic2020 has a good workout with the old disks (same stress + bonnie test). No problem seen. More detailed timing can be seen on grafana.

Mentioned in SAL (#wikimedia-operations) [2017-04-18T14:25:44Z] <gehel> un-ban elastic2020 to get ready for real-life test during switchover - T149006

Last update report:

  • Removed the original disks from the server and put in 2 identical spare disks only difference was the disk size
  • Recreate the RAID by putting each disk in a RAID 0 configuration
  • Gehel re-imaged the server and performed the test on the server

Result: no error

  • Removed one spare disk and replaced with one original disk ( 1 spare disk with 1 original (1) disk)
  • HW diagnostic came up with no error on both disks

-Remove original disk(1) and replaced with original disk (2) ( 1 spare disk with i original (2) disk)
-HW diagnostic came up with no error on both disks

  • Placed both original disk
  • HW diagnostic came up with no error on both disks
  • Recreate the RAID by putting each disk in a RAID 0 configuration
  • Gehel re-imaged the server, performed the test Result: no error

what was done?
Recreating the RAID (RAID 0 on each disk)
Re-image the server

it is possible that recreating the RIAD might have fixed the problem but we can not be sure until tomorrow's switch DC switch over.

Gehel added a comment.Wed, Apr 19, 2:44 PM

elastic2020 crashed again after DC switch. Back to investigations...

Mentioned in SAL (#wikimedia-operations) [2017-04-19T14:46:58Z] <gehel> banning elastic2020 from codfw cluster - T149006

Gehel added a comment.Wed, Apr 19, 4:32 PM

looking at /var/log/kern.log and /var/log/syslog nothing is logged at the time of the crash.

Gehel added a comment.Thu, Apr 20, 2:08 PM

a bad blocks check (as suggested by @Papaul does not find anything wrong with sda:

gehel@elastic2020:~$ sudo badblocks -sv /dev/sda
Checking blocks 0 to 781379415
Checking for bad blocks (read-only test): done                                                 
Pass completed, 0 bad blocks found. (0/0/0 errors)

Seriously, this looks _really_ similar to T149553 (and it is the same vendor even), is there anyway to justify to HP to replace the CPUs to at least discard that?

Gehel added a comment.Thu, Apr 20, 2:16 PM

@Marostegui yes, this sound like a good idea, but this is for @Papaul / @RobH to answer. I am way out of my depth here...

@Marostegui on my side i will have to have something to show HP that the CPU is bad since i have nothing pointing that the CPU is bad it will difficult to convince them for a replacement CPU worst again if it is both CPU's that need to be replaced .

I personal think that we need to stop wasting time on the system ask Dasher to send us a replacement system, since this keep going on for a long time and
@faidon @RobH Please advice on this.

Thanks.

@Papaul right!

Just for the record, the way we were able to justify the error was by seeing it on the ILO after one of the crashes (it was not always logging it).
Sometimes the server was not logging anything on the ILO, the way I found to overcome this was to clear the log manually when the server was up. If that wasn't done, the server wouldn't log stuff. So I would recommend clearing the log manually after every crash (when the server is back up)

Gehel added a comment.Tue, Apr 25, 1:38 PM

I summarized the actions taken on https://etherpad.wikimedia.org/p/elastic2020. @Papaul, could you review it and see if I missed anything significant? Thanks!

@Gehel Thanks everything looks good.

Gehel added a comment.Tue, Apr 25, 3:16 PM

@RobH: the summary is in https://etherpad.wikimedia.org/p/elastic2020. Let me know if it looks good enough to you and if I can do anything else to move this forward...

RobH added a comment.Tue, Apr 25, 3:35 PM

I've sent off an email to Dasher, and cc'd both @Papaul and @Gehel on the email thread.