|Open||None||T130702 Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March|
|Resolved||jcrespo||T160242 es2015 crashed on 2017-03-11|
There are multiple errors on that host, related to memory and CPU (maybe it is the wrong DIMM bank affecting the CPU or the other way around as those can be related to each other):
CreationTimestamp = 20170311055148.000000-360 ElementName = System Event Log Entry RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
CreationTimestamp = 20170311054814.000000-360 ElementName = System Event Log Entry RecordData = CPU 2 has an internal error (IERR).
CreationTimestamp = 20170311054810.000000-360 ElementName = System Event Log Entry RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
@Papaul should we just try to change that DIMM and see what happens?
Description: CPU 1 has an internal error (IERR).
Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:40:20Z] <marostegui> Shutting down es2015 for hardware maintenance - T147769
Papaul claimed this task.Oct 18 2016, 17:24
Papaul added a comment.Oct 18 2016, 17:32
Below are the step taken to troubleshoot this issue.
1- Swapped CPU 1 to CPU2
Dell we replace the main board and the CPU'
Please accept my apologies for the delayed response. I just came into office and hence there was a delay in the response.
The call has been booked for tomorrow. I have also requested the Dell Engineer to call you before visiting the place.
Dispatch Number - 324956405
Service Request Number – 945329880
Address - 1649 W Frankford Rd, Carrollton, TX 75007, USA
Do let know me know if you have any further quires and I will be glad to assist.
Enterprise Technical Support Analyst
Dell EMC | NA Basic Server Support
Enterprise Remote Services and Solutions
It is already disabled since the last crash, so that is done at least:
# es2 'cluster24' => [ '10.192.48.41' => 1, # es2016, master '10.192.0.141' => 2, # es2014 - compressed data # '10.192.32.130' => 1, # es2015, crashed T147769
@Papaul can you check if the network cable is plugged?
The system doesn't have network.
root@es2015:~# rm -fr /etc/udev/rules.d/70-persistent-net.rules
As it is an usual issue when replacing mainboards (with integrated ethernet) but after rebooting, there is still no link.
root@es2015:~# mii-tool eth0 eth0: no link
I am not sure what has happened, but something weird and maybe we have lost its data.
The server got rebooted itself while I was on it and started to run PXE boot and started the installation process. I stopped it but I am not sure if I was able to do it before it wiped the disks.
Why did it boot via PXE??
I forced it to boot from disk, but it is not booting.
The RAID looks healthy from the raid controller (and bios) raid menu, the virtual disk is there.
But the server isn't booting anything:
Booting from Hard drive C:
I think it has been wiped as it doesn't even show the GRUB after selecting to boot from disk.
As I said, the hard disks are being show in the RAID and BIOS menu.
0 Non-RAID Disk(s) found on the host adapter 0 Non-RAID Disk(s) handled by BIOS 1 Virtual Drive(s) found on the host adapter. 1 Virtual Drive(s) handled by BIOS
Integrated RAID Controller 1: Dell PERC <PERC H730 Mini> Configuration Utility Main Menu > Physical Disk Management Physical Disk 00:01:00: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:01: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:02: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:03: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:04: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:05: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:06: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:07: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:08: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:09: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:10: HDD, SATA, 1.818TB, Online, (512B) Physical Disk 00:01:11: HDD, SATA, 1.818TB, Online, (512B)
Main Menu > Virtual Disk Management Virtual Disk 0: RAID10, 10.913TB, Ready
I got the shell on the installer which says:
┌────────────────────────┤ [!] Execute a shell ├────────────────────────┐ │ │ │ Interactive shell │ │ After this message, you will be running "ash", a Bourne-shell clone. │ │ │ │ The root file system is a RAM disk. The hard disk file systems are │ │ mounted on "/target".
But unfortunately there is no /target
~ # ls / bin firmware lib mnt run tmp dev init lib64 proc sbin usr etc initrd media root sys var ~ # ls /mnt ~ #
Also tried to reinstall grub just in case it was the only thing deleted, but also failed on that. So maybe it was indeed reimaged and when I stopped it, was already half way thru the installation process :-(
Probably our best chance is reimage it and copy the content from es2014.
The new mainboard is configured to always boot from PXE.
System BIOS Settings > Boot Settings > BIOS Boot Settings Boot Sequence [Integrated NIC 1...] [Hard drive C:]
Once the server booted up the first time, as it didn't have the ethernet connected, it booted fine to the disk, that is why we could get into it via idrac.
When I rebooted it to make sure it would boot up fine, with the correct date taken from NTP etc, it went to PXE boot as it is the first option, and got reimaged until I caught it in the middle of the process, but probably too late.
Before doing a proper reimage, we need to change the boot sequence to boot first from disk and if not, from the NIC.
I am not being able to change that remotely from the BIOS, @Papaul can you change that for us?
If we do not change that before reimage, the server will reimage everytime it gets rebooted, and it gets rebooted after the reimage process, so it will be looping on a reimage process all the time.
Setting this to High Priority to make sure this is done soon, as we need to copy 4TB over and it can take several days so ideally we would like to start the transfer as soon as possible.
@MoritzMuehlenhoff kindly help and suggested: racadm config -g cfgServerInfo -o cfgServerFirstBootDevice HDD
Which I tried, but had not effect on the boot order:
/admin1-> racadm config -g cfgServerInfo -o cfgServerFirstBootDevice HDD Object value modified successfully
But after restarting it keeps going to PXE
Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had with some of the servers in the past.
The right way to fix this is to "unlock" on dhcp on puppet certain servers to boot from PXE to avoid accidental reimages. Otherwise this will keep happening- a single person running the reimage script, for example, on the wrong server means it is wiped and data is lost.
Yes, agreed that we also need a long-term fix, otherwise this can indeed happen again.
Meanwhile, and as a shortcut, let's try to change the BIOS order and get this server back as soon as we can (as the DC switchover is approaching)
The server is now set up, and ready to get the data from es2014.
Things that have been done:
- Tested a new way to prevent a server to avoid wiping the partitions if it happens to boot up via PXE without being told to do so
- Papaul enabled the IPMI and it now works - was disabled.
I've started the transfer from es2016 to es2015, the transfer may take 11-12 hours, so it will finish by ~6-7 UTC. es2016 and es2015 will be down during it. I have not changed replication topology on es2014- but it would be available should an emergency happens (although lagged).