elastic2020 is powered off and does not want to restart
Open, NormalPublic

Description

elastic2020.codfw.wmnet is marked as down in icinga. By using the management console, I can check that the server is powered off (see log below). power on does not seem to work, the server still reports being powered off.

@Papaul I think this will require your expert hands.

gehel@durin:~$ ssh root@elastic2020.mgmt.codfw.wmnet
root@elastic2020.mgmt.codfw.wmnet's password: 
User:root logged-in to ILOMXQ526080P.dasher.com(10.193.2.217 / FE80::EEB1:D7FF:FE78:2BBC)

iLO 4 Advanced 2.20 at  May 20 2015
Server Name: 
Server Power: Off

Based on customer feedback, we will be enhancing the SSH command line
interface in a future release of the iLO 4 firmware.  Our future CLI will
focus on increased usability and improved functionality.  This message is
to provide advance notice of the coming change.  Please see the iLO 4 
Release Notes on www.hp.com/go/iLO for additional information.


</>hpiLO-> power off hard

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:46 2016

Server power already off.




</>hpiLO-> power reset

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:54 2016

Server power off.




</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon Oct 24 12:10:03 2016



Server powering on .......



</>hpiLO-> vsp

Virtual Serial Port Active: COM2
 The server is not powered on.  The Virtual Serial Port is not available.

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.
There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2017-03-21T08:43:43Z] <gehel> shutting down elasticsearch on elastic2020, investigating T149006

Gehel closed this task as Resolved.Mar 21 2017, 8:56 AM

Running bonnie++ as documented on T153083#2886085 to see if I/O stress as an influence on stability.

Mentioned in SAL (#wikimedia-operations) [2017-03-21T12:47:36Z] <gehel> running stress and bonnie on elastic2020 - T149006

stress is launched with stress --cpu 28 --vm 4

Gehel reopened this task as Open.Mar 21 2017, 3:18 PM

I resolved this by mistake, re-opening.

Gehel added a comment.Mar 21 2017, 3:26 PM

After ~25' of stress + bonnie elastic2020 crashed again. That seem to indicate a systematic issue. Test can be seen on grafana. Test started at ~12:45UTC and server crash at ~13:10UTC.

@Papaul now that we mostly have a way to reproduce the issue, what can we do about it?

Gehel added a comment.Mar 21 2017, 3:29 PM

elastic2020 is banned from elasticsearch cluster and has a 1 month downtime in icinga. Let's figure out what we can do with it before re-enabling icinga.

Gehel added a comment.Mar 22 2017, 8:59 AM

Investigation will continue with @Papaul and @Gehel on Thursday March 23 4pm CET (8am PT)

Gehel added a comment.Mar 24 2017, 9:19 AM

We managed to crash that server again, with the same test (stress + bonnie). @Papaul is running a full H/W diagnostic. Server will remain banned from cluster until we get to the bottom of this.

Here is the result of the HW diagnostic.

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5318387483
Status: Case is generated and in Progress

Product description: HP ProLiant DL360 Gen9 E5-2640v3 2.6GHz 8-core 2P 16GB-R P440ar 8 SFF 500W RPS Server/S-Buy
Product number: 780019-S01
Serial number: MXQ526080P
Subject: DL360 Gen9 - Server Crash Issue

Yours sincerely,
Hewlett Packard Enterprise

@Gehel I open a case with HP and we are working to find a solution to this issue.

Mentioned in SAL (#wikimedia-operations) [2017-03-28T15:56:45Z] <gehel> banning elastic2021 to run same tests as elastic2020 - T149006

@Gehel Been on the phone with HP for about 45 minutes. went over all the logs files they requested and can't find any potential HW cause for this issue. According to the HP guy the OS (Debian) we are running is not supported by the system since it doesn't have the drivers for the system. His recommendation was to remove Debian and install Windows on the system and reproduce the crash (upset..wish doesn't make since) . I told him this can not be done. His argument for 5 minutes was only to remove Debian from the system.

Since we are not seeing any HW issue on the system i think we need to investigate more on our side to see what is causing the server to crash.

@Gehel once you finish testing elastic2021, if it does not crash we will have to take it down for me to compare BIOS settings with elastic2021. if it does crash then we know that the problem is not just on elastic2020.

Thanks

Gehel added a comment.Mar 29 2017, 2:59 PM

The same kind of tests as we did on elastic2020 are running on elastic2021 at the moment. This should help validate that there is an issue with elastic2020 itself and not with our overall configuration.

The same tests have been run on elastic2021 (stress + bonnie++), multiple times, with some pause between runs. elastic2021 has not crashed under that load. Note that when running the tests on elastic2020, much shorter pauses between tests were used. I'm still going to run 2 tests back to back (same as was done on elastic2020) and see what happens. If that does not crash elastic2021, I will put in back into the cluster.

@Gehel Thanks. Once that done i will also update the task on the troubleshooting steps of eastic2020.

Gehel added a comment.EditedMar 30 2017, 7:39 AM

After multiple tests, generating CPU, memory and IO load on elastic2021, the server has not crashed. Those tests are the same as the tests that crashed elastic2020. The timings can be observed on Grafana.

Conclusion: the is something special on elastic2020 that make it crash under load, which cannot be reproduced on elastic2021. @Papaul: I'll let you take over from here.

Mentioned in SAL (#wikimedia-operations) [2017-03-30T07:41:40Z] <gehel> pull elastic2021 back into active duty - T149006

1st crash
Date: October 24, 2016
Troubleshooting : removed both PSU's for a couple of minutes

2nd crash
Date Dcember 12, 2017
Trobleshoooting: Called HP and provide them with all log files
reference number 5315671772
According to HP the logs are not showing any hardware issue; however, the system is running out dated firmware so they recommendation is to update the firmware . All firmwares were updated on the system

3rd crash
Date: March 16, 2017
Troubleshooting Checked all the Settings in the BIOS were correct
Performed a complete hardware diagnostic on the system, this diagnostic took a day to complete
Result of the diagnostic:
Hard drive short DST check: WARRING
Hard drive long DST check: WARRING
Called HP for support
reference number:reference number 5318387483
uploaded all the log files to HP and HP coundn't tell what was causing the problem and affirm that it is not a HW issue since there were nothing in the logs.

Gehel perform the same test on another identicaly system (elasetic2021) and the system didn't crash. This conclude that the prolem has noting to do we the OS or anything we are running on the system.
Since HP uses log files to determinate that there is a Hardware issue on a system and the logs are not showing anything even though there is a problem .

@RobH Since HP will not do anything on this case, what is the next step?

faidon reassigned this task from Papaul to RobH.Apr 3 2017, 4:44 PM
faidon added a subscriber: faidon.

First off, HPE officially supports Debian, so that technician was incorrect here -- and his suggestion to install Windows is absurd.

Second, @RobH, please escalate with HPE about this and ask them for a mainboard or system replacement. Happy to be added in the loop here as an escalation point.

Papaul added a comment.Apr 7 2017, 5:40 AM

@Gehel @RobH I spoke again yesterday with the HP Engineer that did help me on the lvs2002(T162099) issue about this case and after going over the log and taking the into consideration what i fond about the Hard drive warring that the previous HP Engineer didn't take time to investigate, here is what he thinks:

Hi Paul,

As discussed on the call, I noticed that you were using Intel SSDs on the server and these SSDs do not support HPE diagnostics on them and therefore we are unable to pull any details about the SSD from our Smart Storage Administrator. This issue you are facing could possibly be caused by the SSDs you are using but there is no way for HPE to confirm that since these are Intel SSDs.

Thanks & Regards,

Joe added a subscriber: Joe.Apr 7 2017, 5:45 AM

@Gehel @RobH I spoke again yesterday with the HP Engineer that did help me on the lvs2002(T162099) issue about this case and after going over the log and taking the into consideration what i fond about the Hard drive warring that the previous HP Engineer didn't take time to investigate, here is what he thinks:

Hi Paul,

As discussed on the call, I noticed that you were using Intel SSDs on the server and these SSDs do not support HPE diagnostics on them and therefore we are unable to pull any details about the SSD from our Smart Storage Administrator. This issue you are facing could possibly be caused by the SSDs you are using but there is no way for HPE to confirm that since these are Intel SSDs.

Thanks & Regards,

My best guess is they're trying to divert attention from the real problem trying to blame whatever is not covered by their support.

If this keeps happening (HP being so unsupportive and in general a nightmare to interact with), I suggest we start thinking of dropping them altoghether as a vendor.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704071529_gehel_18197.log.

Mentioned in SAL (#wikimedia-operations) [2017-04-07T15:32:17Z] <gehel> reimaging elstic2020 - T149006

Mentioned in SAL (#wikimedia-operations) [2017-04-10T08:48:40Z] <gehel> reimage elastic2020 - T149006

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704100849_gehel_11746.log.

This looks similar to: https://phabricator.wikimedia.org/T149553
Which took us quite some time to debug, but in the end it was a faulty CPU.

Mentioned in SAL (#wikimedia-operations) [2017-04-10T10:58:42Z] <gehel> starting load test on elstic2020 - T149006

Gehel added a comment.Apr 10 2017, 5:17 PM

@Papaul has put in place new SSD in that server.

I've been running the same kind of load test as before for most of the day (see Grafana for details) and the server did not crash.

It looks to me that changing the SSD has a significant impact on the stability of that server.

@Marostegui how did you diagnose the CPU issue?

@Papaul / @RobH I'll let you move forward on this. Ping me if there is anything I can do...

RobH added a comment.Apr 10 2017, 9:16 PM

It seems that these were in the initial Dasher orders where Intel disks were Dasher supported, not HP.

@Papaul: Can you provide me with the model # and serial # of the defective SSD?

Drive Model ATA INTEL SSDSC2BB80

@RobH I was able to pull the information from the HW diagnostic i did last week please see below for information

Disk 1
SCSI Bus 0 (0x00)
SCSIID 1 (0x01)
Block Size 512 Bytes Per Block (0x0200)
Total Blocks 800 GB (0x5d26ceb0)
Reserved Blocks 0x00010000
Drive Model ATA INTEL SSDSC2BB80
Drive Serial Number PHWL524504C0800RGN
Drive Firmware Revision D2010370
SCSI Inquiry Bits 0x02
Compaq Drive Stamped Stamped For Monitoring (0x01)
Last Failure Reason No Failure (0x00)

Disk 2
SCSI Bus 0 (0x00)
SCSIID 0 (0x00)
Block Size 512 Bytes Per Block (0x0200)
Total Blocks 800 GB (0x5d26ceb0)
Reserved Blocks 0x00010000
Drive Model ATA INTEL SSDSC2BB80
Drive Serial Number PHWL524504SB800RGN
Drive Firmware Revision D2010370
SCSI Inquiry Bits 0x02
Compaq Drive Stamped Stamped For Monitoring (0x01)
Last Failure Reason No Failure (0x00)

RobH added a comment.Apr 10 2017, 9:49 PM

This lists two SSDs, which one is the failed one?

RobH added a comment.Apr 10 2017, 11:24 PM

Update from IRC:

Papaul wasn't sure which SSD failed, he just pulled both. He'll place one of the two back in and run the diagnostics again and see if it fails, and then do the same with the other one.

That way we'll know which is bad.

@Marostegui how did you diagnose the CPU issue?

At some point we changed the mainboard but not the CPUs apparently, so it took a bit longer as I would have assumed that changing the mainboard would also change the CPUs, but apparently the HP technician didn't do that.
In the end, by chance, in one of the crashes we saw:

description=Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000004, Status 0xB2000000'71000402, Address 0x00000000'00000000, Misc 0x00000000'00000000)

Most of the crashes did not log stuff if you haven't cleared the logs manually before.
The server was only crashing with high IO wait operations, so that is why we thought it was the disks or the raid controller at the start, but it wasn't.
The whole history is here: T149553 it is a long read, but it is interesting :)

Mentioned in SAL (#wikimedia-operations) [2017-04-12T08:44:30Z] <gehel> reimaging elastic2020 for testing - T149006

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704120844_gehel_14370.log.

Mentioned in SAL (#wikimedia-operations) [2017-04-12T09:42:19Z] <gehel> starting load on elastic2020 - T149006

Gehel added a comment.Apr 12 2017, 1:27 PM

elastic2020 has a good workout with the old disks (same stress + bonnie test). No problem seen. More detailed timing can be seen on grafana.

Mentioned in SAL (#wikimedia-operations) [2017-04-18T14:25:44Z] <gehel> un-ban elastic2020 to get ready for real-life test during switchover - T149006

Last update report:

  • Removed the original disks from the server and put in 2 identical spare disks only difference was the disk size
  • Recreate the RAID by putting each disk in a RAID 0 configuration
  • Gehel re-imaged the server and performed the test on the server

Result: no error

  • Removed one spare disk and replaced with one original disk ( 1 spare disk with 1 original (1) disk)
  • HW diagnostic came up with no error on both disks

-Remove original disk(1) and replaced with original disk (2) ( 1 spare disk with i original (2) disk)
-HW diagnostic came up with no error on both disks

  • Placed both original disk
  • HW diagnostic came up with no error on both disks
  • Recreate the RAID by putting each disk in a RAID 0 configuration
  • Gehel re-imaged the server, performed the test Result: no error

what was done?
Recreating the RAID (RAID 0 on each disk)
Re-image the server

it is possible that recreating the RIAD might have fixed the problem but we can not be sure until tomorrow's switch DC switch over.

Gehel added a comment.Apr 19 2017, 2:44 PM

elastic2020 crashed again after DC switch. Back to investigations...

Mentioned in SAL (#wikimedia-operations) [2017-04-19T14:46:58Z] <gehel> banning elastic2020 from codfw cluster - T149006

Gehel added a comment.Apr 19 2017, 4:32 PM

looking at /var/log/kern.log and /var/log/syslog nothing is logged at the time of the crash.

Gehel added a comment.Apr 20 2017, 2:08 PM

a bad blocks check (as suggested by @Papaul does not find anything wrong with sda:

gehel@elastic2020:~$ sudo badblocks -sv /dev/sda
Checking blocks 0 to 781379415
Checking for bad blocks (read-only test): done                                                 
Pass completed, 0 bad blocks found. (0/0/0 errors)

Seriously, this looks _really_ similar to T149553 (and it is the same vendor even), is there anyway to justify to HP to replace the CPUs to at least discard that?

Gehel added a comment.Apr 20 2017, 2:16 PM

@Marostegui yes, this sound like a good idea, but this is for @Papaul / @RobH to answer. I am way out of my depth here...

@Marostegui on my side i will have to have something to show HP that the CPU is bad since i have nothing pointing that the CPU is bad it will difficult to convince them for a replacement CPU worst again if it is both CPU's that need to be replaced .

I personal think that we need to stop wasting time on the system ask Dasher to send us a replacement system, since this keep going on for a long time and
@faidon @RobH Please advice on this.

Thanks.

@Papaul right!

Just for the record, the way we were able to justify the error was by seeing it on the ILO after one of the crashes (it was not always logging it).
Sometimes the server was not logging anything on the ILO, the way I found to overcome this was to clear the log manually when the server was up. If that wasn't done, the server wouldn't log stuff. So I would recommend clearing the log manually after every crash (when the server is back up)

Gehel added a comment.Apr 25 2017, 1:38 PM

I summarized the actions taken on https://etherpad.wikimedia.org/p/elastic2020. @Papaul, could you review it and see if I missed anything significant? Thanks!

@Gehel Thanks everything looks good.

Gehel added a comment.Apr 25 2017, 3:16 PM

@RobH: the summary is in https://etherpad.wikimedia.org/p/elastic2020. Let me know if it looks good enough to you and if I can do anything else to move this forward...

RobH added a comment.Apr 25 2017, 3:35 PM

I've sent off an email to Dasher, and cc'd both @Papaul and @Gehel on the email thread.

Mentioned in SAL (#wikimedia-operations) [2017-04-28T14:55:53Z] <gehel> shutting down elastic2020 for mainboard replacement - T149006

Dzahn added a comment.Apr 28 2017, 3:47 PM

mainboard is being replaced right now

Mainboard replacement complete.

Change 350882 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address for elastic2020:mainboard replaced

https://gerrit.wikimedia.org/r/350882

Change 350882 merged by Dzahn:
[operations/puppet@production] DHCP: Change MAC address for elastic2020:mainboard replaced

https://gerrit.wikimedia.org/r/350882

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201705021209_gehel_30647.log.

Completed auto-reimage of hosts:

['elastic2020.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2017-05-02T13:13:28Z] <gehel> load testing elastic2020 before putting it back in the cluster - T149006

Mentioned in SAL (#wikimedia-operations) [2017-05-02T13:26:54Z] <gehel> stopping load on elastic2020 - T149006

Gehel added a comment.Tue, May 2, 1:29 PM

one of the SSD is in error, waiting for the new one to arrive before running new load tests.

Papaul added a comment.Tue, May 2, 3:22 PM

After replacing the main board. at first book HP ILO detected that one of the SSD's was bad. After a couple of reboots the error was no longer showing on the ILO but the HDD led was stay showing that the SSD was bad which was not the case with the old main board. I email Brynden @dasher to request a SSD replacement. and a return label since the new main board was shipped without a return label. Please see attachment for the SSD error. Below are the information on the SSD
Serial Number PHWL524504SB800RGN
Model INTEL
Media Type SSD
Capacity 800 GB

Papaul added a comment.Wed, May 3, 6:44 PM

Shipped back the bad main board.

RobH added a comment.Mon, May 15, 4:56 PM

So part of the issue on this system is it is a lease, not WMF owned. We cannot just use shelf spares, since we have to use approved lease hardware (due to lease and service contracts).

@Papaul: Can you please clarify, since I wasn't on all the emails for support, exactly what they said about the SSD replacement and when? They should be shipping us a replacement SSD quickly, not taking two weeks.

Wed 5/10/2017 10:45 AM
Thank you Papaul,

I have put in a request to Intel Support.
They will reply with a form that we will need to fill out and then they will send out a new drive.

I’ll email you as soon as I get the warranty form.
Bo Rivera
Integration Engineer
Dasher Technologies, Inc.

Thu 5/11/2017 10:29 AM from Please see below.
Bo Rivera

Please see below.

Please see below.
Hello,
An update was made to service request 02788059 on May 11, 2017:
Hello Dasher support team,

We understand that you have an Intel® SSD DC S3500 Series that has failed.

In order to have a better understanding of this request, please provide us with the following information:

  • What's the exact issue with the SSD? How has it failed?
  • How was the drive being used (boot drive, part of RAID, etc.)?
  • Have you performed any troubleshooting so far? If so, what troubleshooting steps were followed?

Looking forward to assist you,

Eugenio
Sign in to view and update your request or to get additional information. You can also reply to this email with questions or comments.

Regards,
Intel Support Team

RobH added a comment.EditedMon, May 15, 5:15 PM

Ok, I'm going to attempt to summarize what I know to be the current issue(s) with elastic2020.

  • System has issues starting back in October 2016.
  • HP support is non-cooperative, resulting in multiple tickets and firmware updates, and eventual escalation up to @RobH to handle with Dasher (April 2017).
    • Rob pings Dasher (April 2017) , who ping HPE and get our mainboard swap dispatched.
  • Mainboard is swapped out on May 2nd, 2017. After swap, an SSD is found to be faulty.

Now this is the confusing and not so fun parts:

  • This is a system lease, not a purchase, so all hardware changes likely need to be submitted to Finance and Farnam; using an off the shelf spare is likely non-ideal since it means we lose a disk at the end of the lease.
  • These SSDs were the some of the initial orders of Intel SSDs via Dasher/HPE. These initial orders included SSDs that are NOT covered under the system service contract for HP. (This issue was pointed out at the time of purchase, see purchase ticket RT#9579.)
  • The SSD likely needs to be swapped out with another of the same model, and these are Intel 3500 SSDs, an older model we no longer use.

So the outstanding question is how long will it take for Dasher to provide the new SSD.

  • Papaul emails Dasher about the failed SSD.
    • @Papaul: What date was this email on?
  • Dasher emails back to provide info needed by Intel.
    • What date was this reply?
  • Papaul emails Dasher the full SSD details
    • What date was this sent?

Has Dasher, at any point, provided any kind of timeline on how long it will take the SSD to be replaced?

@Papaul: advise on above questions, and please review the summary of issues and provide feedback on accuracy. Thanks!

Papaul added a comment.EditedMon, May 15, 5:25 PM

Email Dasher about the failed SSD may 1
Hello Brynden, I received the main board and was in the processing of installing and testing everything because I realized that i was dropping connectivity on NIC1 and the SSD 1 has failed.Also I didn't received any return label for the bad main board. Can you please send me a return label and also possible a replacement SSD. please see below for SSD information and i have attached the SSD error.
Serial Number PHWL524504SB800RGN
Model INTEL
Media Type SSD
Capacity 800 GB

may 8 emaill Dahser again
Hello Brynden,

We have one SSD that is bad on the same system that we replaced the main-board. I sent you the information on the SSD last week. Can you please give me a update on the replacement SSD ?

Thanks.

Dasher reply back on may 8
Hi Papaul,

I need the SA# please. See the image below for where it is located.

Bo Rivera

I email the info on may 10 again

got back reply from Dasher on may 10
Thank you Papaul,

I have put in a request to Intel Support.

They will reply with a form that we will need to fill out and then they will send out a new drive.

I’ll email you as soon as I get the warranty form.

Bo Rivera

RobH added a comment.Mon, May 15, 5:37 PM

Ok, I've emailed Dasher to inquire about this with the following:

Dasher Folks,

So it seems some of this conversation was out of the thread, between Bo and Papaul. I just want to loop back in and see what is going on. My understanding is that Wikimedia provided Dasher with the full SSD details on May 10th, and got a reply back from Bo advising he was contacting Intel support.

So we are now waiting on Intel warranty support, correct?

What exactly is the timeline for replacing the SSD in this system? I ask, since it is now down and not usable with any reliability, and we cannot just put a spare SSD in, since this is a system lease via Farnam.

Is the SSD replacement something that will take days, weeks?

Please advise,

That being said, we could likely just toss in a spare SSD, with a stalled task to swap it back out when the replacement comes in. This swap would be required, as we have to put back in the same model SSD. This would allow the system to remain online with disk redundancy.

@Papaul: Do we have any spare SSDs on the shelf for temp use in this system?

@RobH yes we do; but there are 300GB

RobH added a comment.Mon, May 15, 8:42 PM

@Papaul:

The spares tracking shows that we have 3 of the Intel S3610 800GB ssds on the spare shelf? We recently ordered these for shelf spares, along with 2 1.6TB Intel S3610 SSDs.

So are those not on the shelf? Not authorizing them for this use, just asking about it. Please advise.

Dasher has started some actual movement on this (it seems) since I bugged them via email today, so we may see a replacement for this shortly.

@RobH yes we do have some 800GB SSDs for spare but the one we are trying to replace is DC S3500 series.

RobH added a comment.EditedMon, May 15, 8:53 PM

Ahh, sorry for the miscommunication then.

So, here is where we stand on this system

  • It is a lease, if a shelf spare is used, it can ONLY be for temporary use. This means when a warranty replacement SSD arrives, the shelf spare has to be wiped of data and returned to spares.
    • This is a lot of overhead to bring a single system back online, and is only worthwhile if this system cannot remain offline for another couple of weeks.
  • Dasher has responded back to my email about the timeline for this SSD replacement today. Bo@Dasher advises that once we provide Intel with the shipment address details (which I have already confirmed and provided back to Dasher), they typically take a day or two to ship the replacement out.

So if this system can remain offline for another week or two, it would be easier to avoid using the shelf spare. If we use the shelf spare, it CANNOT remain in the system, as the system is a lease and needs to have the Intel S3500 model SSD leased with it returned with the system at the end of the lease. Trying to track this kind of thing long term is a nightmare, so leased systems simply shouldn't have hardware swapped with shelf spares unless absolutely necessary.

If we do have to use a shelf spare (The intel s3610 800gb), its non ideal and only for temp use. It is an option, but one that is easier for us to avoid using at this time.

RobH added a comment.Mon, May 15, 8:54 PM

@Gehel: Can you advise if this can remain offline for another week or two for the SSD replacement. See my comment above for full details.

Yes, elastic2020 can stay offline for one more week.

RobH added a comment.Mon, May 15, 10:24 PM

cool, we'll avoid using a shelf spare then and i'll be following up with dasher on a daily basis until resolution.

RobH added a comment.EditedMon, May 22, 4:52 PM

We've gotten a notice from Intel, FWD from Dasher, that they'll be shipping a replacement disk, and a return tag for the defective disk.

However, we didn't get a forward of the tracking info email. I've emailed Dasher requesting they give us this info.

Edit Addtion: Basically took a few days last week of back and forth between dasher and intel support to get this going.

RobH added a comment.Mon, May 22, 5:09 PM

So it turns out Intel wants the disk sent back in advance. Can this disk detect enough for us to perform an wipe on it?

Otherwise we need to send this back. If we cannot handle the downtime once its on its way, we may use a shelf spare. The use of a spare can be determined AFTER we get this defective disk sent off.

RobH reassigned this task from RobH to Papaul.Mon, May 22, 5:09 PM
RobH added a comment.Mon, May 22, 5:34 PM

I've put elastic2020 into maint mode in icinga for the next month, and have shut it down.

@Papaul, you can boot the system now into wipe and clear the defective SSD (if it detects.)

Gehel added a comment.Tue, May 23, 7:17 AM

We can keep elastic2020 down for a few more weeks if needed. The cluster is able to sustain the current load with -1 node.

Disk wipe complete system is backup with 1 disk

Bad disk has been shipped to Intel. Please see below for shipping tracking information.