es2015 crashed on 2017-03-11
Closed, ResolvedPublic

jcrespo created this task.Sat, Mar 11, 8:25 AM

There are multiple errors on that host, related to memory and CPU (maybe it is the wrong DIMM bank affecting the CPU or the other way around as those can be related to each other):

		CreationTimestamp = 20170311055148.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
		CreationTimestamp = 20170311054814.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 2 has an internal error (IERR).
		CreationTimestamp = 20170311054810.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

@Papaul should we just try to change that DIMM and see what happens?

Marostegui moved this task from Triage to In progress on the DBA board.Mon, Mar 13, 9:47 AM

From T147769:

Description: CPU 1 has an internal error (IERR).

Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:40:20Z] <marostegui> Shutting down es2015 for hardware maintenance - T147769
Papaul claimed this task.Oct 18 2016, 17:24
Papaul added a comment.Oct 18 2016, 17:32
Comment Actions

Below are the step taken to troubleshoot this issue.

1- Swapped CPU 1 to CPU2

So looks like the CPU is broken then and needs replacement.
@Papaul let's dismiss the DIMM change and proceed to change that CPU that has failed twice now?

Since we swapped CPU's in T147769 and we still have the same error, I will contact Dell once on site tomorrow for CPU replacement.

Sounds good - thank you! if you need to "justify" it, the idrac logs are here: T160242#3094702

Dell we replace the main board and the CPU'

Hi Papaul,

Please accept my apologies for the delayed response. I just came into office and hence there was a delay in the response.

The call has been booked for tomorrow. I have also requested the Dell Engineer to call you before visiting the place.

Dispatch Number - 324956405
Service Request Number – 945329880
Address - 1649 W Frankford Rd, Carrollton, TX 75007, USA

Do let know me know if you have any further quires and I will be glad to assist.

Regards,
Madhu RN,
Enterprise Technical Support Analyst
Dell EMC | NA Basic Server Support
Enterprise Remote Services and Solutions

Thank you very much! We will shutdown the server tomorrow ahead of time- ping us if have more details about the predicted schedule for it.

Thank you very much! We will shutdown the server tomorrow ahead of time- ping us if have more details about the predicted schedule for it.

It is already disabled since the last crash, so that is done at least:

        # es2
        'cluster24' => [
                '10.192.48.41'  => 1, # es2016, master
                '10.192.0.141'  => 2, # es2014 - compressed data
#               '10.192.32.130' => 1, # es2015, crashed T147769

Thanks Papaul!

Mentioned in SAL (#wikimedia-operations) [2017-03-16T13:38:07Z] <marostegui> Shutdown es2015 for maintenance - T160242

@Papaul es2015 is now off.
Please turn it on once you are done with the main board replacement.

Thank you!

The Dell technician didn't show up and @Papaul has arranged another appointment for Monday.
I will be off on Monday, so it will need to be powered of by @jcrespo.

I have just powered on the server and started MySQL and replication.

Mentioned in SAL (#wikimedia-operations) [2017-03-20T13:55:35Z] <jynus> shutting down es2015 for maintenance T160242

I see the server is still down, @Papaul did the technician finally show up yesterday?

Papaul assigned this task to Marostegui.Tue, Mar 21, 5:43 PM

Main board and CPU2 replacement complete. System back up online.

Thanks Papaul! I will take it from here

@Papaul can you check if the network cable is plugged?
The system doesn't have network.
I ran:

root@es2015:~# rm -fr /etc/udev/rules.d/70-persistent-net.rules

As it is an usual issue when replacing mainboards (with integrated ethernet) but after rebooting, there is still no link.

root@es2015:~# mii-tool eth0
eth0: no link

es2015.codfw.wmnet needs a mysql_upgrade run before restarting replication. BTW, I fixed some things on the new package: now you can run mysql_upgrade correctly on the path.

I have restarted it myself- its ping returned- I assume Papaul did something?

Papaul just replugged the cable and it works now:

root@es2015:~# mii-tool eth0
eth0: negotiated, link ok

Thanks @Papaul!

I am not sure what has happened, but something weird and maybe we have lost its data.
The server got rebooted itself while I was on it and started to run PXE boot and started the installation process. I stopped it but I am not sure if I was able to do it before it wiped the disks.

Why did it boot via PXE??

I forced it to boot from disk, but it is not booting.
The RAID looks healthy from the raid controller (and bios) raid menu, the virtual disk is there.
But the server isn't booting anything:

Booting from Hard drive C:

I think it has been wiped as it doesn't even show the GRUB after selecting to boot from disk.
As I said, the hard disks are being show in the RAID and BIOS menu.

0 Non-RAID Disk(s) found on the host adapter
0 Non-RAID Disk(s) handled by BIOS

1 Virtual Drive(s) found on the host adapter.
1 Virtual Drive(s) handled by BIOS
Integrated RAID Controller 1: Dell PERC <PERC H730 Mini> Configuration
  Utility
  Main Menu > Physical Disk Management

  Physical Disk 00:01:00: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:01: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:02: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:03: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:04: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:05: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:06: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:07: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:08: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:09: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:10: HDD, SATA, 1.818TB, Online, (512B)
  Physical Disk 00:01:11: HDD, SATA, 1.818TB, Online, (512B)
Main Menu > Virtual Disk Management

Virtual Disk 0: RAID10, 10.913TB, Ready

I got the shell on the installer which says:

┌────────────────────────┤ [!] Execute a shell ├────────────────────────┐
   │                                                                       │
   │                           Interactive shell                           │
   │ After this message, you will be running "ash", a Bourne-shell clone.  │
   │                                                                       │
   │ The root file system is a RAM disk. The hard disk file systems are    │
   │ mounted on "/target".

But unfortunately there is no /target

~ # ls /
bin       firmware  lib       mnt       run       tmp
dev       init      lib64     proc      sbin      usr
etc       initrd    media     root      sys       var
~ # ls /mnt
~ #

Also tried to reinstall grub just in case it was the only thing deleted, but also failed on that. So maybe it was indeed reimaged and when I stopped it, was already half way thru the installation process :-(
Probably our best chance is reimage it and copy the content from es2014.

The new mainboard is configured to always boot from PXE.

System BIOS Settings > Boot Settings > BIOS Boot Settings

Boot Sequence                                         [Integrated NIC 1...]
                                                      [Hard drive C:]

Once the server booted up the first time, as it didn't have the ethernet connected, it booted fine to the disk, that is why we could get into it via idrac.
When I rebooted it to make sure it would boot up fine, with the correct date taken from NTP etc, it went to PXE boot as it is the first option, and got reimaged until I caught it in the middle of the process, but probably too late.

Marostegui reassigned this task from Marostegui to Papaul.Wed, Mar 22, 8:20 AM
Marostegui raised the priority of this task from "Normal" to "High".

Before doing a proper reimage, we need to change the boot sequence to boot first from disk and if not, from the NIC.
I am not being able to change that remotely from the BIOS, @Papaul can you change that for us?
If we do not change that before reimage, the server will reimage everytime it gets rebooted, and it gets rebooted after the reimage process, so it will be looping on a reimage process all the time.

Setting this to High Priority to make sure this is done soon, as we need to copy 4TB over and it can take several days so ideally we would like to start the transfer as soon as possible.

@MoritzMuehlenhoff kindly help and suggested: racadm config -g cfgServerInfo -o cfgServerFirstBootDevice HDD
Which I tried, but had not effect on the boot order:

/admin1-> racadm config -g cfgServerInfo -o cfgServerFirstBootDevice HDD
Object value modified successfully

But after restarting it keeps going to PXE

Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had with some of the servers in the past.

The right way to fix this is to "unlock" on dhcp on puppet certain servers to boot from PXE to avoid accidental reimages. Otherwise this will keep happening- a single person running the reimage script, for example, on the wrong server means it is wiped and data is lost.

Probably what happened is that on boar change, BIOS was reseted and not changed to the default "boot from disk"- a problem I think we had with some of the servers in the past.

The right way to fix this is to "unlock" on dhcp on puppet certain servers to boot from PXE to avoid accidental reimages. Otherwise this will keep happening- a single person running the reimage script, for example, on the wrong server means it is wiped and data is lost.

Yes, agreed that we also need a long-term fix, otherwise this can indeed happen again.
Meanwhile, and as a shortcut, let's try to change the BIOS order and get this server back as soon as we can (as the DC switchover is approaching)

Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['es2015.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703221221_jynus_23855.log.

Change 344149 had a related patch set uploaded (by Jcrespo):
[operations/puppet] install_server: Test the new db recipe db-no-srv-format on es2015

https://gerrit.wikimedia.org/r/344149

Change 344149 merged by Jcrespo:
[operations/puppet] install_server: Test the new db recipe db-no-srv-format on es2015

https://gerrit.wikimedia.org/r/344149

Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['es2015.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703221446_jynus_415.log.

Change 344160 had a related patch set uploaded (by Jcrespo):
[operations/puppet] install_server: Do not remove the lvm partition for db-no-srv-format

https://gerrit.wikimedia.org/r/344160

Change 344160 merged by Jcrespo:
[operations/puppet] install_server: Do not remove the lvm partition for db-no-srv-format

https://gerrit.wikimedia.org/r/344160

Marostegui lowered the priority of this task from "High" to "Normal".Wed, Mar 22, 4:42 PM
Marostegui removed Papaul as the assignee of this task.

The server is now set up, and ready to get the data from es2014.
Things that have been done:

  • Tested a new way to prevent a server to avoid wiping the partitions if it happens to boot up via PXE without being told to do so
  • Papaul enabled the IPMI and it now works - was disabled.

Completed auto-reimage of hosts:

['es2015.codfw.wmnet']

Of which those FAILED:

set(['es2015.codfw.wmnet'])

I am going to use the codfw master es2016 not es2014, because the latter does have compressed tables- something we have yet to fix, and not something we want to propagate.

I've started the transfer from es2016 to es2015, the transfer may take 11-12 hours, so it will finish by ~6-7 UTC. es2016 and es2015 will be down during it. I have not changed replication topology on es2014- but it would be available should an emergency happens (although lagged).

jcrespo closed this task as "Resolved".Thu, Mar 23, 9:48 AM