Page MenuHomePhabricator

elastic2020 is powered off and does not want to restart
Closed, ResolvedPublic

Description

elastic2020.codfw.wmnet is marked as down in icinga. By using the management console, I can check that the server is powered off (see log below). power on does not seem to work, the server still reports being powered off.

@Papaul I think this will require your expert hands.

gehel@durin:~$ ssh root@elastic2020.mgmt.codfw.wmnet
root@elastic2020.mgmt.codfw.wmnet's password: 
User:root logged-in to ILOMXQ526080P.dasher.com(10.193.2.217 / FE80::EEB1:D7FF:FE78:2BBC)

iLO 4 Advanced 2.20 at  May 20 2015
Server Name: 
Server Power: Off

Based on customer feedback, we will be enhancing the SSH command line
interface in a future release of the iLO 4 firmware.  Our future CLI will
focus on increased usability and improved functionality.  This message is
to provide advance notice of the coming change.  Please see the iLO 4 
Release Notes on www.hp.com/go/iLO for additional information.


</>hpiLO-> power off hard

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:46 2016

Server power already off.




</>hpiLO-> power reset

status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Mon Oct 24 12:09:54 2016

Server power off.




</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Mon Oct 24 12:10:03 2016



Server powering on .......



</>hpiLO-> vsp

Virtual Serial Port Active: COM2
 The server is not powered on.  The Virtual Serial Port is not available.

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704071529_gehel_18197.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704100849_gehel_11746.log.

This looks similar to: https://phabricator.wikimedia.org/T149553
Which took us quite some time to debug, but in the end it was a faulty CPU.

Mentioned in SAL (#wikimedia-operations) [2017-04-10T10:58:42Z] <gehel> starting load test on elstic2020 - T149006

@Papaul has put in place new SSD in that server.

I've been running the same kind of load test as before for most of the day (see Grafana for details) and the server did not crash.

It looks to me that changing the SSD has a significant impact on the stability of that server.

@Marostegui how did you diagnose the CPU issue?

@Papaul / @RobH I'll let you move forward on this. Ping me if there is anything I can do...

It seems that these were in the initial Dasher orders where Intel disks were Dasher supported, not HP.

@Papaul: Can you provide me with the model # and serial # of the defective SSD?

Drive Model ATA INTEL SSDSC2BB80

@RobH I was able to pull the information from the HW diagnostic i did last week please see below for information

Disk 1
SCSI Bus 0 (0x00)
SCSIID 1 (0x01)
Block Size 512 Bytes Per Block (0x0200)
Total Blocks 800 GB (0x5d26ceb0)
Reserved Blocks 0x00010000
Drive Model ATA INTEL SSDSC2BB80
Drive Serial Number PHWL524504C0800RGN
Drive Firmware Revision D2010370
SCSI Inquiry Bits 0x02
Compaq Drive Stamped Stamped For Monitoring (0x01)
Last Failure Reason No Failure (0x00)

Disk 2
SCSI Bus 0 (0x00)
SCSIID 0 (0x00)
Block Size 512 Bytes Per Block (0x0200)
Total Blocks 800 GB (0x5d26ceb0)
Reserved Blocks 0x00010000
Drive Model ATA INTEL SSDSC2BB80
Drive Serial Number PHWL524504SB800RGN
Drive Firmware Revision D2010370
SCSI Inquiry Bits 0x02
Compaq Drive Stamped Stamped For Monitoring (0x01)
Last Failure Reason No Failure (0x00)

This lists two SSDs, which one is the failed one?

Update from IRC:

Papaul wasn't sure which SSD failed, he just pulled both. He'll place one of the two back in and run the diagnostics again and see if it fails, and then do the same with the other one.

That way we'll know which is bad.

@Marostegui how did you diagnose the CPU issue?

At some point we changed the mainboard but not the CPUs apparently, so it took a bit longer as I would have assumed that changing the mainboard would also change the CPUs, but apparently the HP technician didn't do that.
In the end, by chance, in one of the crashes we saw:

description=Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000004, Status 0xB2000000'71000402, Address 0x00000000'00000000, Misc 0x00000000'00000000)

Most of the crashes did not log stuff if you haven't cleared the logs manually before.
The server was only crashing with high IO wait operations, so that is why we thought it was the disks or the raid controller at the start, but it wasn't.
The whole history is here: T149553 it is a long read, but it is interesting :)

Mentioned in SAL (#wikimedia-operations) [2017-04-12T08:44:30Z] <gehel> reimaging elastic2020 for testing - T149006

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201704120844_gehel_14370.log.

Mentioned in SAL (#wikimedia-operations) [2017-04-12T09:42:19Z] <gehel> starting load on elastic2020 - T149006

elastic2020 has a good workout with the old disks (same stress + bonnie test). No problem seen. More detailed timing can be seen on grafana.

Mentioned in SAL (#wikimedia-operations) [2017-04-18T14:25:44Z] <gehel> un-ban elastic2020 to get ready for real-life test during switchover - T149006

Last update report:

  • Removed the original disks from the server and put in 2 identical spare disks only difference was the disk size
  • Recreate the RAID by putting each disk in a RAID 0 configuration
  • Gehel re-imaged the server and performed the test on the server

Result: no error

  • Removed one spare disk and replaced with one original disk ( 1 spare disk with 1 original (1) disk)
  • HW diagnostic came up with no error on both disks

-Remove original disk(1) and replaced with original disk (2) ( 1 spare disk with i original (2) disk)
-HW diagnostic came up with no error on both disks

  • Placed both original disk
  • HW diagnostic came up with no error on both disks
  • Recreate the RAID by putting each disk in a RAID 0 configuration
  • Gehel re-imaged the server, performed the test Result: no error

what was done?
Recreating the RAID (RAID 0 on each disk)
Re-image the server

it is possible that recreating the RIAD might have fixed the problem but we can not be sure until tomorrow's switch DC switch over.

elastic2020 crashed again after DC switch. Back to investigations...

Mentioned in SAL (#wikimedia-operations) [2017-04-19T14:46:58Z] <gehel> banning elastic2020 from codfw cluster - T149006

looking at /var/log/kern.log and /var/log/syslog nothing is logged at the time of the crash.

a bad blocks check (as suggested by @Papaul does not find anything wrong with sda:

gehel@elastic2020:~$ sudo badblocks -sv /dev/sda
Checking blocks 0 to 781379415
Checking for bad blocks (read-only test): done                                                 
Pass completed, 0 bad blocks found. (0/0/0 errors)

Seriously, this looks _really_ similar to T149553 (and it is the same vendor even), is there anyway to justify to HP to replace the CPUs to at least discard that?

@Marostegui yes, this sound like a good idea, but this is for @Papaul / @RobH to answer. I am way out of my depth here...

@Marostegui on my side i will have to have something to show HP that the CPU is bad since i have nothing pointing that the CPU is bad it will difficult to convince them for a replacement CPU worst again if it is both CPU's that need to be replaced .

I personal think that we need to stop wasting time on the system ask Dasher to send us a replacement system, since this keep going on for a long time and
@faidon @RobH Please advice on this.

Thanks.

@Papaul right!

Just for the record, the way we were able to justify the error was by seeing it on the ILO after one of the crashes (it was not always logging it).
Sometimes the server was not logging anything on the ILO, the way I found to overcome this was to clear the log manually when the server was up. If that wasn't done, the server wouldn't log stuff. So I would recommend clearing the log manually after every crash (when the server is back up)

I summarized the actions taken on https://etherpad.wikimedia.org/p/elastic2020. @Papaul, could you review it and see if I missed anything significant? Thanks!

@RobH: the summary is in https://etherpad.wikimedia.org/p/elastic2020. Let me know if it looks good enough to you and if I can do anything else to move this forward...

I've sent off an email to Dasher, and cc'd both @Papaul and @Gehel on the email thread.

Mentioned in SAL (#wikimedia-operations) [2017-04-28T14:55:53Z] <gehel> shutting down elastic2020 for mainboard replacement - T149006

mainboard is being replaced right now

Mainboard replacement complete.

Change 350882 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address for elastic2020:mainboard replaced

https://gerrit.wikimedia.org/r/350882

Change 350882 merged by Dzahn:
[operations/puppet@production] DHCP: Change MAC address for elastic2020:mainboard replaced

https://gerrit.wikimedia.org/r/350882

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201705021209_gehel_30647.log.

Completed auto-reimage of hosts:

['elastic2020.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2017-05-02T13:13:28Z] <gehel> load testing elastic2020 before putting it back in the cluster - T149006

Mentioned in SAL (#wikimedia-operations) [2017-05-02T13:26:54Z] <gehel> stopping load on elastic2020 - T149006

one of the SSD is in error, waiting for the new one to arrive before running new load tests.

After replacing the main board. at first book HP ILO detected that one of the SSD's was bad. After a couple of reboots the error was no longer showing on the ILO but the HDD led was stay showing that the SSD was bad which was not the case with the old main board. I email Brynden @dasher to request a SSD replacement. and a return label since the new main board was shipped without a return label. Please see attachment for the SSD error. Below are the information on the SSD
Serial Number PHWL524504SB800RGN
Model INTEL
Media Type SSD
Capacity 800 GB

Selection_003.png (422×740 px, 22 KB)

Shipped back the bad main board.

Selection_004.png (656×452 px, 397 KB)

So part of the issue on this system is it is a lease, not WMF owned. We cannot just use shelf spares, since we have to use approved lease hardware (due to lease and service contracts).

@Papaul: Can you please clarify, since I wasn't on all the emails for support, exactly what they said about the SSD replacement and when? They should be shipping us a replacement SSD quickly, not taking two weeks.

Wed 5/10/2017 10:45 AM
Thank you Papaul,

I have put in a request to Intel Support.
They will reply with a form that we will need to fill out and then they will send out a new drive.

I’ll email you as soon as I get the warranty form.
Bo Rivera
Integration Engineer
Dasher Technologies, Inc.

Thu 5/11/2017 10:29 AM from Please see below.
Bo Rivera

Please see below.

Please see below.
Hello,
An update was made to service request 02788059 on May 11, 2017:
Hello Dasher support team,

We understand that you have an Intel® SSD DC S3500 Series that has failed.

In order to have a better understanding of this request, please provide us with the following information:

  • What's the exact issue with the SSD? How has it failed?
  • How was the drive being used (boot drive, part of RAID, etc.)?
  • Have you performed any troubleshooting so far? If so, what troubleshooting steps were followed?

Looking forward to assist you,

Eugenio
Sign in to view and update your request or to get additional information. You can also reply to this email with questions or comments.

Regards,
Intel Support Team

Ok, I'm going to attempt to summarize what I know to be the current issue(s) with elastic2020.

  • System has issues starting back in October 2016.
  • HP support is non-cooperative, resulting in multiple tickets and firmware updates, and eventual escalation up to @RobH to handle with Dasher (April 2017).
    • Rob pings Dasher (April 2017) , who ping HPE and get our mainboard swap dispatched.
  • Mainboard is swapped out on May 2nd, 2017. After swap, an SSD is found to be faulty.

Now this is the confusing and not so fun parts:

  • This is a system lease, not a purchase, so all hardware changes likely need to be submitted to Finance and Farnam; using an off the shelf spare is likely non-ideal since it means we lose a disk at the end of the lease.
  • These SSDs were the some of the initial orders of Intel SSDs via Dasher/HPE. These initial orders included SSDs that are NOT covered under the system service contract for HP. (This issue was pointed out at the time of purchase, see purchase ticket RT#9579.)
  • The SSD likely needs to be swapped out with another of the same model, and these are Intel 3500 SSDs, an older model we no longer use.

So the outstanding question is how long will it take for Dasher to provide the new SSD.

  • Papaul emails Dasher about the failed SSD.
    • @Papaul: What date was this email on?
  • Dasher emails back to provide info needed by Intel.
    • What date was this reply?
  • Papaul emails Dasher the full SSD details
    • What date was this sent?

Has Dasher, at any point, provided any kind of timeline on how long it will take the SSD to be replaced?

@Papaul: advise on above questions, and please review the summary of issues and provide feedback on accuracy. Thanks!

Email Dasher about the failed SSD may 1
Hello Brynden, I received the main board and was in the processing of installing and testing everything because I realized that i was dropping connectivity on NIC1 and the SSD 1 has failed.Also I didn't received any return label for the bad main board. Can you please send me a return label and also possible a replacement SSD. please see below for SSD information and i have attached the SSD error.
Serial Number PHWL524504SB800RGN
Model INTEL
Media Type SSD
Capacity 800 GB

may 8 emaill Dahser again
Hello Brynden,

We have one SSD that is bad on the same system that we replaced the main-board. I sent you the information on the SSD last week. Can you please give me a update on the replacement SSD ?

Thanks.

Dasher reply back on may 8
Hi Papaul,

I need the SA# please. See the image below for where it is located.

Bo Rivera

I email the info on may 10 again

got back reply from Dasher on may 10
Thank you Papaul,

I have put in a request to Intel Support.

They will reply with a form that we will need to fill out and then they will send out a new drive.

I’ll email you as soon as I get the warranty form.

Bo Rivera

Ok, I've emailed Dasher to inquire about this with the following:

Dasher Folks,

So it seems some of this conversation was out of the thread, between Bo and Papaul. I just want to loop back in and see what is going on. My understanding is that Wikimedia provided Dasher with the full SSD details on May 10th, and got a reply back from Bo advising he was contacting Intel support.

So we are now waiting on Intel warranty support, correct?

What exactly is the timeline for replacing the SSD in this system? I ask, since it is now down and not usable with any reliability, and we cannot just put a spare SSD in, since this is a system lease via Farnam.

Is the SSD replacement something that will take days, weeks?

Please advise,

That being said, we could likely just toss in a spare SSD, with a stalled task to swap it back out when the replacement comes in. This swap would be required, as we have to put back in the same model SSD. This would allow the system to remain online with disk redundancy.

@Papaul: Do we have any spare SSDs on the shelf for temp use in this system?

@Papaul:

The spares tracking shows that we have 3 of the Intel S3610 800GB ssds on the spare shelf? We recently ordered these for shelf spares, along with 2 1.6TB Intel S3610 SSDs.

So are those not on the shelf? Not authorizing them for this use, just asking about it. Please advise.

Dasher has started some actual movement on this (it seems) since I bugged them via email today, so we may see a replacement for this shortly.

@RobH yes we do have some 800GB SSDs for spare but the one we are trying to replace is DC S3500 series.

Ahh, sorry for the miscommunication then.

So, here is where we stand on this system

  • It is a lease, if a shelf spare is used, it can ONLY be for temporary use. This means when a warranty replacement SSD arrives, the shelf spare has to be wiped of data and returned to spares.
    • This is a lot of overhead to bring a single system back online, and is only worthwhile if this system cannot remain offline for another couple of weeks.
  • Dasher has responded back to my email about the timeline for this SSD replacement today. Bo@Dasher advises that once we provide Intel with the shipment address details (which I have already confirmed and provided back to Dasher), they typically take a day or two to ship the replacement out.

So if this system can remain offline for another week or two, it would be easier to avoid using the shelf spare. If we use the shelf spare, it CANNOT remain in the system, as the system is a lease and needs to have the Intel S3500 model SSD leased with it returned with the system at the end of the lease. Trying to track this kind of thing long term is a nightmare, so leased systems simply shouldn't have hardware swapped with shelf spares unless absolutely necessary.

If we do have to use a shelf spare (The intel s3610 800gb), its non ideal and only for temp use. It is an option, but one that is easier for us to avoid using at this time.

@Gehel: Can you advise if this can remain offline for another week or two for the SSD replacement. See my comment above for full details.

Yes, elastic2020 can stay offline for one more week.

cool, we'll avoid using a shelf spare then and i'll be following up with dasher on a daily basis until resolution.

We've gotten a notice from Intel, FWD from Dasher, that they'll be shipping a replacement disk, and a return tag for the defective disk.

However, we didn't get a forward of the tracking info email. I've emailed Dasher requesting they give us this info.

Edit Addtion: Basically took a few days last week of back and forth between dasher and intel support to get this going.

So it turns out Intel wants the disk sent back in advance. Can this disk detect enough for us to perform an wipe on it?

Otherwise we need to send this back. If we cannot handle the downtime once its on its way, we may use a shelf spare. The use of a spare can be determined AFTER we get this defective disk sent off.

I've put elastic2020 into maint mode in icinga for the next month, and have shut it down.

@Papaul, you can boot the system now into wipe and clear the defective SSD (if it detects.)

We can keep elastic2020 down for a few more weeks if needed. The cluster is able to sustain the current load with -1 node.

Disk wipe complete system is backup with 1 disk

Bad disk has been shipped to Intel. Please see below for shipping tracking information.

Selection_005.png (747×667 px, 348 KB)

Waiting for a new SSD. Then reimage, test and back in the cluster if everything looks fine.

I just got a notice from Dasher in the last ten minutes.

Tracking 1Z3AX9710270067734. This will ship out to Papaul, and once it arrives he can install it.

@ Gehel The new SSD is in place

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706121255_gehel_18389.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706121316_gehel_10193.log.

Only one disk is seen by debian installer, the raid probably needs to be re-created outside of the OS, I'm checking...

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['elastic2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706121510_gehel_5308.log.

Completed auto-reimage of hosts:

['elastic2020.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2017-06-12T17:24:46Z] <gehel> running stress + bonnie on elastic2020 to check new hardware - T149006

Mentioned in SAL (#wikimedia-operations) [2017-06-13T08:01:39Z] <gehel> adding elastic2020 back in the elasticsearch cluster - T149006

Gehel added a subscriber: debt.

elastic2020 is back into rotation, stress tests show no issue. @debt: this can be closed...

Actually, we are still going to do a last test of switching traffic from eqiad to codfw and see if that server crashes or not.

Change 359007 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/mediawiki-config@master] Revert "Test elastic2020 does not fall out of cluster"

https://gerrit.wikimedia.org/r/359007

Change 359007 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Test elastic2020 does not fall out of cluster"

https://gerrit.wikimedia.org/r/359007

Mentioned in SAL (#wikimedia-operations) [2017-06-14T23:30:16Z] <catrope@tin> Synchronized wmf-config/InitialiseSettings.php: Send search traffic back to eqiad T149006 (duration: 00m 44s)

Tested a switchover and about 5 hours of traffic, elastic2020 seemed happy enough and acted like the rest of the servers. I think it's safe to call this fixed.

Nothing more we can test at this point. It looks like elastic2020 is alive and well. Let's cross a few fingers to make sure it stays that way.