Page MenuHomePhabricator

db2085/db1106 don't boot with 4.9.0-8-amd64
Closed, ResolvedPublic

Description

I did a full-upgrade and the server doesn't boot anymore with 4.9.0-8-amd64. This was tried several times.

While this could be a hardware issue, it does boot with 4.9.0-7-amd64, having a low, but possible reason to be a kernel/OS regression.

Loading freezes at:

Loading ramdisk... step

I will leave the server with 4.9.0-7-amd64 for now, as I cannot leave it depooled for long.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-codfw.php: Depool db2085
operations/mediawiki-config : masterdb-eqiad.php: Depool db1106

Event Timeline

jcrespo created this task.Jan 28 2019, 1:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2019, 1:37 PM

db2089 failed once when rebooted into 4.9.0-8-amd64, worked a second time. Worried because it maybe a random thing?

Marostegui added a comment.EditedJan 30 2019, 2:13 PM

I had a chat with Moritz about this we he was not too sure it would be a kernel thing itself as in something really wrong with the kernel or maybe some sort of hardware thing or just maybe a punctual thing although you mentioned it was tried several times.

Do we have any other exact combination of that hardware and kernel?

db2085 and db2089 come likely from the same batch, and no other batch showed those issues, so it may be happening only on those hosts.

We could narrow this down further by enabling debug flags for the initrd, I don't remember the specific options out of the top of my head, but we can look into this next week. As Manuel mentioned, my hunch is that this is a hw issues which manifests during the reboots, but which is not caused by the kernel change between -7 and -8 itself.

Marostegui triaged this task as Medium priority.Feb 1 2019, 11:18 PM
Marostegui moved this task from Triage to In progress on the DBA board.

Same thing just happened with db1106 (PowerEdge R630 - same chassis as db2085)
@MoritzMuehlenhoff can you help us with the approach you mentioned at T214840#4918369 ?

Marostegui renamed this task from db2085 doesn't boot with 4.9.0-8-amd64 to db2085/db1106 don't boot with 4.9.0-8-amd64.Feb 11 2019, 5:07 PM
Marostegui updated the task description. (Show Details)Feb 11 2019, 5:17 PM

@paravoid gave us some food for thought:

stuck at "loading ramdisk" is sometimes an indication of misconfigured serial redirection after boot
basically when Linux and the BIOS are fighting over control of the serial port

Change 490079 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2085

https://gerrit.wikimedia.org/r/490079

Change 490079 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2085

https://gerrit.wikimedia.org/r/490079

Mentioned in SAL (#wikimedia-operations) [2019-02-12T15:38:08Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool db2085 - T214840 (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2019-02-12T15:38:11Z] <marostegui> Stop MySQL on db2085 - T214840

@MoritzMuehlenhoff has installed 4.9.144-3 on db2085.
Out of 8 reboots, two of them got stuck (in a row).
1st reboot by @MoritzMuehlenhoff OK
2nd reboot by @MoritzMuehlenhoff OK
3rd reboot by @Marostegui OK
4th reboot by @Marostegui OK
5th reboot by @Marostegui OK
6th reboot by @Marostegui FAIL
7th reboot by @Marostegui FAIL
8th reboot by @Marostegui OK

Marostegui added a comment.EditedFeb 12 2019, 4:31 PM

After restarting with the previous kernel 4.9.0-7-amd64 on db2085, the first time it didn't boot up, the second time it did.

@MoritzMuehlenhoff has removed -8 kernel from db2085 and I have rebooted it 8 times with -7 now

1st reboot: OK
2nd reboot: OK
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: FAIL
7th reboot: OK
8th reboot: OK

This is the same pattern as -8 had at T214840#4948016 where the 6th reboot failed.

There is also the fact that yesterday, @jcrespo reboot db1106 with -6 and the first time it didn't work (similar to what happened with -7 at T214840#4948088) . So it might not be related to any specific kernel version?
We should probably investigate T214840#4944412 either to confirm or to discard that.

Mentioned in SAL (#wikimedia-operations) [2019-02-12T17:04:55Z] <marostegui> Start MySQL again on db2085 for s1 and s8 - T214840

Mentioned in SAL (#wikimedia-operations) [2019-02-13T06:12:04Z] <marostegui> Stop MySQL on db2085 to keep debugging kernel issues - T214840

db2085:
So I can confirm that the BIOS setting for Serial Communication is being sent to COM2 (which is ttyS1).
Which is the same as:

linux   /boot/vmlinuz-4.9.0-7-amd64 root=UUID=63e5ddbd-3c18-4bf5-ad22-88458ec175b7 ro ixgbe.allow_unsupported_sfp=1 console=ttyS1,115200n8 elevator=deadline

I have changed the BIOS to COM1 (ttyS0).
Now I am going to reboot this kernel 4.9.0-7-amd64 8 times again to see how it goes.
If it goes fine, I will install -8 and do the same.

db2085 with kernel 4.9.0-7-amd64 reboots, another FAIL at the 6th and 7th reboot (similar patter as with kernel -9 at T214840#4948016):

1st reboot: OK
2nd reboot: OK
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: FAIL
7th reboot: FAIL
8th reboot: OK

Marostegui added a comment.EditedFeb 13 2019, 7:01 AM

db2085 current BIOS setting:

Marostegui added a comment.EditedFeb 13 2019, 7:27 AM

db2085: debug added to the kernel boot, to see if we catch something

	linux	/boot/vmlinuz-4.9.0-7-amd64 root=UUID=63e5ddbd-3c18-4bf5-ad22-88458ec175b7 ro ixgbe.allow_unsupported_sfp=1 console=ttyS1,115200n8 elevator=deadline debug

Rebooting 8 times again.

db2085 reboots with 4.9.0-7 with debug enabled - all fine:

1st reboot: OK
2nd reboot: OK
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: OK
7th reboot: OK
8th reboot: OK
9th reboot: OK
10th reboot: OK

db2085 reboots with 4.9.0-8 with debug enabled:

1st reboot: OK
2nd reboot: FAIL (unfortunately no kernel error trace just an automatic reboot)

However, after that automatic reboot, we got something on the startup:

EFI0078: One or more Machine Check errors occurred in the previous boot.
Check the System Event Log (SEL) to identify the source of the Machine Check
error and resolve the issues.

Digging thru the logs we can see:

		CreationTimestamp = 20190213081039.000000-360
		ElementName = System Event Log Entry
		RecordData = An OEM diagnostic event occurred.
		RecordFormat = string Description
		RecordID = 73

And then the interesting ones

/admin1/system1/logs1/log1-> show record13

	properties
		CreationTimestamp = 20190213081038.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 1 machine check error detected.
		RecordFormat = string Description
		RecordID = 61

/admin1/system1/logs1/log1-> show record26

	properties
		CreationTimestamp = 20190213081037.000000-360
		ElementName = System Event Log Entry
		RecordData = CPU 2 machine check error detected.
		RecordFormat = string Description
		RecordID = 48

db2085 got stuck when booting up on:

[    0.560579] x86: Booting SMP configuration:
[    0.565246] .... node  #1, CPUs:        #1
[    0.674090] .... node  #0, CPUs:    #2
Marostegui assigned this task to Papaul.Feb 13 2019, 8:59 AM
Marostegui edited projects, added ops-codfw; removed Packaging.
Marostegui added a subscriber: Papaul.

After power cycling db2085, this is what happened:

reboot: OK
reboot: OK
reboot: FAIL
reboot: FAIL
reboot: FAIL

Error on post:

Enumerating Boot options... Done

UEFI0078: One or more Machine Check errors occurred in the previous boot.
Check the System Event Log (SEL) to identify the source of the Machine Check
error and resolve the issues.


Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for LifeCycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

And we get again CPU errors on getsel:

Record:      1
Date/Time:   02/13/2019 08:37:22
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/13/2019 08:53:48
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/13/2019 08:53:48
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      8
Date/Time:   02/13/2019 08:53:48
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      21
Date/Time:   02/13/2019 08:53:49
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      34
Date/Time:   02/13/2019 08:53:50
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------

Apart from those errors, we have dozens of:

Record:      7
Date/Time:   02/13/2019 08:53:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

I am going to assign this to @Papaul to get some advise on how we should proceed in regards with these CPU errors.
We should try to reproduce this on db1106 and see if we get the same issue.

Change 490291 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/490291

Change 490291 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/490291

Mentioned in SAL (#wikimedia-operations) [2019-02-13T09:21:50Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1106 T214840 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-02-13T09:22:10Z] <marostegui> Stop MySQL on db1106 - T214840

db1106 with 4.9.0-8 with debug enabled on the kernel, reboots sequence:

1st reboot: FAIL
2nd reboot: OK
3rd reboot: OK
4th reboot: FAIL
5th reboot: OK
6th reboot: FAIL
7th reboot: OK
8th reboot: FAIL

I found this on one of the failed errors:

UEFI0019: Lifecycle Controller (LC) is unable to complete a requested task or
function and prevented the boot process from completing on multiple attempts.
LC is in Recovery Mode.
Repair Lifecycle Controller firmware using the Lifecycle Controller Dell Update
Package (DUP) or Lifecycle Controller Repair Package via iDRAC. For more
information, see Lifecycle Controller User's Guide.

I will ping @Cmjohnson about it, let's try to upgrade all firmwares asap.

Mentioned in SAL (#wikimedia-operations) [2019-02-13T15:46:02Z] <marostegui> Stop MySQL on db1106 for onsite maintenance - this will generate lag on s1 labs - T214840

Chris has upgraded FW/BIOS on db1106 (thanks!) - so tomorrow I will do a few more reboots to keep debugging this.

@Marostegui in most cases the CPU1/CPU2 Machine check error detected is caused from outdated BIOS. I will recommend that we first update the BIOS. The system BIOS right now is at 2.4.3 and there is a new version out (2.9.1) from 11/02/2019.After this we can check some settings in the BIOS under BIOS profile .

@Marostegui in most cases the CPU1/CPU2 Machine check error detected is caused from outdated BIOS. I will recommend that we first update the BIOS. The system BIOS right now is at 2.4.3 and there is a new version out (2.9.1) from 11/02/2019.After this we can check some settings in the BIOS under BIOS profile .

Thank you - let me know when it is a good moment to put this host down and get it upgraded

Mentioned in SAL (#wikimedia-operations) [2019-02-14T07:36:54Z] <marostegui> Stop MySQL on db1106 for reboot - T214840

After the FW and BIOS upgraded I have rebooted db1106 a number of times with 4.9.0-8 and this is the result:

1st reboot: OK
2nd reboot: OK
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: OK
7th reboot: OK
8th reboot: OK
9th reboot: OK
10th reboot: OK

So looks like the BIOS and FW upgraded worked on db1106.
Let's wait for db2085 to be upgraded by @Papaul and do the same test.

Mentioned in SAL (#wikimedia-operations) [2019-02-14T08:10:31Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1106 T214840 (duration: 00m 52s)

After the FW and BIOS upgraded I have rebooted db1106 a number of times with 4.9.0-8 and this is the result

yay

Macro goat-for-it:

@Marostegui this can be done anytime today. Just let me know when the server is down. Thanks

@Papaul thanks - I am going to put it down now. Will ping you on IRC once it is down
Thanks!

Mentioned in SAL (#wikimedia-operations) [2019-02-14T14:20:46Z] <marostegui> Stop MySQL on db2085 for on-site maintenance - T214840

Papaul reassigned this task from Papaul to Marostegui.Feb 14 2019, 3:10 PM

Upgrade
BIOS from 2.4.3 to 2.9.1
IDRAC from 2.40. to 2.60

Thank you! I will delete the idrac logs and start testing

Mentioned in SAL (#wikimedia-operations) [2019-02-14T15:12:08Z] <marostegui> Clear idrac logs from db2085 - T214840

Marostegui closed this task as Resolved.Feb 14 2019, 3:44 PM
Marostegui added a subscriber: faidon.

Reboot tests with db2085 4.9.0-8 after getting the BIOS and FW upgraded by Papaul (T214840#4954418)

1st reboot: OK
2nd reboot: OK
3rd reboot: OK
4th reboot: OK
5th reboot: OK
6th reboot: OK
7th reboot: OK
8th reboot: OK
9th reboot: OK
10th reboot: OK

So after the BIOS upgraded, this host is also fixed.
Same as db1106 (T214840#4953504).

On both cases it seemed to be some sort of maybe (fake) hardware issue that not always showed up (T214840#4950395 and T214840#4950294) that was gone after the BIOS upgrade and could possible prevented the kernel from boot certain times.
Thanks everyone for helping out to fix this mystery @MoritzMuehlenhoff @jcrespo @faidon @Cmjohnson @Papaul!

Are there other servers of that batch beside db1106 and db2085?

yeah, I would like to see this applied to similar servers- while not in a hurry, I prefer this done rather than suffering after a crash or an emergency reboot, specially on codfw where it is easy to reboot servers (and these servers are not going to be decomm. soon).

Mentioned in SAL (#wikimedia-operations) [2019-02-14T16:10:15Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2085 - T214840 (duration: 00m 52s)

All eqiad servers from the same batch as db1106 are running 4.9.0-8 already
db1096-db1106

All codfw servers from the same batch as db2085 are running 4.9.0-8 already as well.
db2071-db2092

My suggestion would be to take one or 2 codfw servers, reboot it a few times and see if it suffers the same issues. Maybe I just got lucky and the next time we reboot it (under pressure) it gets stuck.

It is really easy to reboot codfw servers, and I can take care of that if you want me to.

Sure - go ahead :-)

Creating a separate ticket for that, will refer here.

JFTR, the next Stretch update (this weekend) will update the kernel to 4.9.144-2, so that can be piggybacked.