Page MenuHomePhabricator

icinga1001 crashed
Closed, ResolvedPublic

Description

I found icinga1001 crashed (no ping, no ssh, black screen at the console). I've forced a reboot and it seemed to have rebooted fine. I just had to restart ircecho as it didn't connect properly the first time (icinga-wm was not in the oprations channel).

We were lucky that happened that I was awake although very late and noticed the email from our external monitoring, that, although few false positives in recent months, should probably be promoted to a paging alert.

As for the diagnostic, racadm getsel reported:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/26/2019 10:35:33
Source:      system
Severity:    Ok
Description: A problem was detected during Power-On Self-Test (POST).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/26/2019 10:35:33
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------

while racadm getraclog reported:

--------------------------------------------------------------------------------
SeqNumber       = 210
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:34:05
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 209
Message ID      = SYS1000
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:33:54
Message         = System is turning on.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 208
Message ID      = LOG007
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:33:54
Message         = The previous log entry was repeated 1 times.
Message Arg   1 = 1
--------------------------------------------------------------------------------
SeqNumber       = 206
Message ID      = SYS1001
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:33:45
Message         = System is turning off.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 205
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:33:45
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 203
Message ID      = SYS1000
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:33:06
Message         = System is turning on.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 202
Message ID      = SYS1001
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:32:57
Message         = System is turning off.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 201
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-01-26 10:32:57
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 199
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2019-01-26 10:32:41
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

Event Timeline

Joe triaged this task as High priority.Feb 7 2019, 12:45 PM

The host crashed again today and got rebooted, nothing in getsel and from getraclog I just got:

--------------------------------------------------------------------------------
SeqNumber       = 226
Message ID      = SYS1001
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-02-09 11:32:17
Message         = System is turning off.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 225
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-02-09 11:32:16
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 224
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2019-02-09 11:32:00
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

Adding ops-eqiad too at this point as we're starting to have too many crashes given the specific role of this host.
It's unclear at this point if the reboot might be related to the AWOL passive checks (see T196336)

Mentioned in SAL (#wikimedia-operations) [2019-02-10T19:16:50Z] <volans|off> forcing reboot of icinga1001 because it's stuck again (no ping, no ssh, CPU stuck messages on console) - T214760

icinga1001 was stuck again today, but in a slightly different way that gave us some additional information.
No ping, no ssh and no icinga web were working and no racadm errors were logged, but attaching myself to the console, although unable to get a prompt, I was able to capture the following, that was repeated every few seconds:

[113545.865202] NMI watchdog: BUG: soft lockup - CPU#27 stuck for 22s! [kworker/27:2:245071]
[113554.688447] INFO: rcu_sched self-detected stall on CPU[113554.692485] INFO: rcu_sched detected stalls on CPUs/tasks:
[113554.692493] 	27-...: (1217459 ticks this GP) idle=af9/140000000000001/0 softirq=10652833/10652835 fqs=386235
[113554.692495]
[113554.710741] 	27-...: (1217459 ticks this GP) idle=af9/140000000000001/0 softirq=10652833/10652835 fqs=386238
[113554.720719] 	 (t=1218394 jiffies g=3031637 c=3031636 q=156534)
[113581.780720] NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [kworker/27:2:245071]

See also the related kernel documentation for the RCU stall.

One of the possible options is still hardware failure (quote from the above link):

o	A hardware failure.  This is quite unlikely, but has occurred
	at least once in real life.  A CPU failed in a running system,
	becoming unresponsive, but not causing an immediate crash.
	This resulted in a series of RCU CPU stall warnings, eventually
	leading the realization that the CPU had failed.

Although this fits our case, there might be different valid explanations too that might explain the same symptoms that we have, some more digging is needed.
Also the fact that icinga2001 has an uptime of 87 days, runs the same software apart ircecho and runs all the checks (so same load in theory) and doesn't show any of those symptoms (so far, finger crossed), makes me lean towards the hardware issue option on icinga1001.

I propose to failover to icinga2001 until we find out what's wrong with this one and we fix it.
Thoughts?

As an additional datapoint from T210108, that I've just merged into this task, is that we had 2 reboots that showed the same symptoms during provisioning, I think even before the Icinga software was running.

@Cmjohnson @RobH
My idea would be to physically check that the CPU are correctly mounted first and then try to get a replacement for the apparently faulty one. Do you think that the evidence we have is enough?

I propose to failover to icinga2001 until we find out what's wrong with this one and we fix it.
Thoughts?

The existing downtimes would be lost?

The existing downtimes would be lost?

Absolutely not, we already sync the icinga state file periodically and the failover procedure has a specific step to do a last sync before failing over, see https://wikitech.wikimedia.org/wiki/Icinga#Failover_Icinga_between_the_active_and_passive_servers

Thanks for clarifying that @Volans!
Then probably failing over to another host is a good idea so we can debug icinga1001 without having service interruptions

Thanks!

Change 489777 had a related patch set uploaded (by Volans; owner: Volans):
[operations/dns@master] Failover icinga to icinga2001

https://gerrit.wikimedia.org/r/489777

Change 489790 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] icinga: cleanup legacy code

https://gerrit.wikimedia.org/r/489790

Change 489791 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] icinga: failover to icinga2001

https://gerrit.wikimedia.org/r/489791

Change 489790 merged by Volans:
[operations/puppet@production] icinga: cleanup legacy code

https://gerrit.wikimedia.org/r/489790

Change 489791 merged by CDanis:
[operations/puppet@production] icinga: failover to icinga2001

https://gerrit.wikimedia.org/r/489791

Change 489777 merged by CDanis:
[operations/dns@master] Failover icinga to icinga2001

https://gerrit.wikimedia.org/r/489777

Mentioned in SAL (#wikimedia-operations) [2019-02-11T22:53:21Z] <cdanis> icinga.w.o-->icinga2001 DNS change deployed T214760

Mentioned in SAL (#wikimedia-operations) [2019-02-11T23:22:13Z] <cdanis> T214760 icinga2001% sudo killall nsca

Icinga was failovered to icinga2001, @Cmjohnson, @RobH we can proceed either to check if the CPU is properly mounted and/or try to get some replacement parts based on current evidence.

Pulled from racadm getsel

/admin1-> racadm getsel
Record:      1
Date/Time:   05/30/2018 17:49:01
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/26/2019 10:35:33
Source:      system
Severity:    Ok
Description: A problem was detected during Power-On Self-Test (POST).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/26/2019 10:35:33
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

Icinga was failovered to icinga2001, @Cmjohnson, @RobH we can proceed either to check if the CPU is properly mounted and/or try to get some replacement parts based on current evidence.

So it sounds like we can reboot as needed?

I don't see any kind of issue where reseating the CPU would fix it. Mostly because it is impossible for the CPU to become unseated unless the heatsink is also removed. So its either a bad CPU or bad mainboard socket by the looks of @Volans comment on T214760#4941652.

After syncing with @Volans via the dcops irc channel, he'd like this host to remain online for a short while longer while things stabilize over on icinga2001. Once this host can be put offline, I'd like to do the following:

  • - reboot the host via serial console and read the actual POST error that is referenced by the ilom sel (service event log).
  • - update firmware, see if error goes away
  • - if error doesnt go away, hopefully post error will indicate specific failure. additional troubleshooting will be detailed based on POST error message.

@CDanis: Please assign this back to me tomorrow (Tuesday, Feb 12th) after confirming icinga2001 is stable enough for me to perform the above tests.

icinga2001 looks stable; go for it Rob

Mentioned in SAL (#wikimedia-operations) [2019-02-12T21:10:02Z] <robh> working on troubleshooting icinga1001 via T214760

Ok, rebooted the system and watched it POST, no errors. A quick grep of SEL shows no additional entries from T214760#4945789.

I'm now flashing the bios from 1.3.7 (December 2017) to 1.7.0 (latest for R440). Once that is done, I'll check POST and run the built in Dell hw tests.

Also updating idrac from 3.15.17.15 to latest 3.21.26.22

Mentioned in SAL (#wikimedia-operations) [2019-02-12T21:38:14Z] <robh> icinga1001 in hardware testing, dont mess with it T214760

Ok, I've run the hardware tests and nothing reports as broken.

I'd suggest we return this to service, since we aren't seeing any further errors. A single unspecified POST error that never repeats (and I rebooted it a half a dozen times in a row) and doesn't show up in the hardware tests (ran twice) means that this is likely fine.

@Volans or @CDanis: Would one of you review, and if you don't disagree, go ahead and return this to service?

If it has another issue in the future, we atleast have a full dump of the info on this task.

OK, but IMO let's keep it passive. icinga can continue to run on icinga2001 for now.

@RobH I actually disagree as the host has crashed already 2 times before it was even in production, so without any icinga-related load, than at least once a couple of months ago and I think 4 times in the last 3 weeks, so I wouldn't call it stable at all 😉

OK, but IMO let's keep it passive. icinga can continue to run on icinga2001 for now.

@CDanis: being eqiad our main active datacenter we prefer to have the active icinga on the same DC for obvious reasons.

Ok, so I'm going to address some of the error messages and log messages here:

The host crashed again today and got rebooted, nothing in getsel and from getraclog I just got:

--------------------------------------------------------------------------------
SeqNumber       = 226
Message ID      = SYS1001
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-02-09 11:32:17
Message         = System is turning off.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 225
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-02-09 11:32:16
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 224
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2019-02-09 11:32:00
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

Adding ops-eqiad too at this point as we're starting to have too many crashes given the specific role of this host.
It's unclear at this point if the reboot might be related to the AWOL passive checks (see T196336)

So, every time a racadm power command for powering up or power cycling is sent, the getraclog will show those three entries, they are not an error, they are the auditing of the power reset command. This includes the CPU message. I've audited another spare pool system at random (wmf7426) and it shows the same entries in its log when power reset commands are sent. The 'errors' listed on T210108 seem to show the same thing, a remote power cycle command was sent and no actual errors in the SEL.

Pulled from racadm getsel

/admin1-> racadm getsel
Record:      1
Date/Time:   05/30/2018 17:49:01
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/26/2019 10:35:33
Source:      system
Severity:    Ok
Description: A problem was detected during Power-On Self-Test (POST).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/26/2019 10:35:33
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------

This is the ONLY actual error I can find anywhere in the hardware logs, and I cannot reproduce it even though I rebooted the host over half a dozen times today before and half a dozen times after firmware/bios updates.

Then there is:

icinga1001 was stuck again today, but in a slightly different way that gave us some additional information.
No ping, no ssh and no icinga web were working and no racadm errors were logged, but attaching myself to the console, although unable to get a prompt, I was able to capture the following, that was repeated every few seconds:

[113545.865202] NMI watchdog: BUG: soft lockup - CPU#27 stuck for 22s! [kworker/27:2:245071]
[113554.688447] INFO: rcu_sched self-detected stall on CPU[113554.692485] INFO: rcu_sched detected stalls on CPUs/tasks:
[113554.692493] 	27-...: (1217459 ticks this GP) idle=af9/140000000000001/0 softirq=10652833/10652835 fqs=386235
[113554.692495]
[113554.710741] 	27-...: (1217459 ticks this GP) idle=af9/140000000000001/0 softirq=10652833/10652835 fqs=386238
[113554.720719] 	 (t=1218394 jiffies g=3031637 c=3031636 q=156534)
[113581.780720] NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [kworker/27:2:245071]

See also the related kernel documentation for the RCU stall.

One of the possible options is still hardware failure (quote from the above link):

o	A hardware failure.  This is quite unlikely, but has occurred
	at least once in real life.  A CPU failed in a running system,
	becoming unresponsive, but not causing an immediate crash.
	This resulted in a series of RCU CPU stall warnings, eventually
	leading the realization that the CPU had failed.

Although this fits our case, there might be different valid explanations too that might explain the same symptoms that we have, some more digging is needed.
Also the fact that icinga2001 has an uptime of 87 days, runs the same software apart ircecho and runs all the checks (so same load in theory) and doesn't show any of those symptoms (so far, finger crossed), makes me lean towards the hardware issue option on icinga1001.

I propose to failover to icinga2001 until we find out what's wrong with this one and we fix it.
Thoughts?

So this shows a CPU error, but the hardware tests pass and I'm not sure Dell will accept that as enough. I'll try to determine what CPU it means.

So with the comments from @Volans on T214760#4941652, it seems this may be an issue with CPU#27, which is the second CPU. It may be enough to get another CPU sent by Dell, since returning it to service in the other slot and hoping for failure seems problematic.

Chris,

Can you open a support request with Dell and insist on a replacement CPU due to the output of T214760#4941652 please?

@RobH thanks a lot for the tests and follow up!
Just to clarify a detail regarding the "normal" reboot messages, at least in a couple of occasions (for the others I'm not fully sure), the host rebooted itself (triggered either by mgmt board or kernel) but was not manually rebooted by us.

I requested a new CPU but w/out Dell's idrac log stating it's a CPU there is a good chance they will kick it back.

You have successfully submitted request SR986384843.

I requested a new CPU but w/out Dell's idrac log stating it's a CPU there is a good chance they will kick it back.

You have successfully submitted request SR986384843.

If they kick it back and won't just send a new CPU, let me know and we'll escalate it to our account team.

10:12 < cmjohnson1> : robh Dell approved everything....the disks for cloudvirts and the cpu for icinga1001

So we are on track. Chris can update this task with case info and the like, but wanted to get the update on task showing this was submitted for CPU replacement and approved by Dell.

CPU2 was replaced

Shipping Info
USPS 9202 3946 5301 2441 0151 11
FEDEX 9611918 2393026 77765139

icinga1001 is unresponsive this morning (no ping, no ssh, black console), re-opening

Mentioned in SAL (#wikimedia-operations) [2019-02-21T09:35:43Z] <volans> force rebooting unresponsive icinga1001 T214760

Hardware logs;

$ sudo ipmi-sel
[... SNIP ...]
5   | Feb-20-2019 | 13:34:48 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 10h
6   | Feb-20-2019 | 13:34:48 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 01h ; OEM Event Data3 code = 00h
7   | Feb-20-2019 | 13:34:48 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
8   | Feb-20-2019 | 13:34:48 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data2 code = 00h
9   | Feb-20-2019 | 13:34:48 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
10  | Feb-20-2019 | 13:34:48 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 02h ; OEM Event Data3 code = 00h
11  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
12  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh ; OEM Event Data2 code = B6h
13  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh
14  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 03h ; OEM Event Data3 code = 00h
15  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
16  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh ; OEM Event Data2 code = B6h
17  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh
18  | Feb-20-2019 | 13:34:49 | Chipset Info     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data2 code = F2h ; OEM Event Data3 code = 85h
19  | Feb-20-2019 | 13:34:49 | Err Reg Pointer  | OEM Reserved                | OEM Event Offset = 00h
20  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh
21  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
22  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h ; OEM Event Data2 code = 00h
23  | Feb-20-2019 | 13:34:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
24  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Eh
25  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
26  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh ; OEM Event Data2 code = B6h
27  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh
28  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh
29  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
30  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Dh ; OEM Event Data2 code = B6h
31  | Feb-20-2019 | 13:34:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Fh

@RobH and it crashed again already! I'll leave it down in case @Cmjohnson wants to attach a physical console.
Anyway, it's all yours, can be shutdown/reboot at will.

/admin1-> racadm getsel
Record:      1
Date/Time:   02/12/2019 21:44:15
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/19/2019 17:13:26
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/19/2019 17:13:31
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/20/2019 13:34:48
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   02/20/2019 13:34:48
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/20/2019 13:34:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/20/2019 13:34:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   02/20/2019 13:34:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   02/20/2019 13:34:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   02/20/2019 13:34:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   02/20/2019 13:34:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   02/20/2019 13:34:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   02/21/2019 11:12:05
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   02/21/2019 11:12:05
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   02/21/2019 11:12:05
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   02/21/2019 11:12:05
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   02/21/2019 11:12:05
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      38
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      39
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      40
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      41
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      42
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      43
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      45
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      46
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      47
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      48
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      49
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      50
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      51
Date/Time:   02/21/2019 11:12:06
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      52
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      53
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      54
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      55
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      56
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      57
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      58
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      59
Date/Time:   02/21/2019 11:12:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2019-02-21T18:30:56Z] <robh> ignore icinga1001 alerts, rebooting it into hardware tests via T214760

icigna1001 passes ALL dell hardware tests when the built in extended tests were run. Annoying it doesnt show the errors we see in the OS logs.

I swapped CPU1 w/ CPU2 and cleared the log. Please monitor to see where and if the error continues or moves.

FYI it crashed again:

--------------------------------------------------------------------------------
SeqNumber       = 481
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2019-02-22 04:04:12
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 480
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2019-02-22 04:04:10
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

with:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/22/2019 04:05:55
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/22/2019 04:05:55
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/22/2019 04:05:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
[...CUT a ton of other OEM diagnostic events ...]

So it seems that the error moved to CPU1 now

Interesting! So a new CPU2 from Dell is throwing errors, and it replaced another CPU that was throwing errors.

That’s not really all that interesting, they send us refurbished parts all
the time that don’t work. I will submit another ticket for a new CPU.

I forgot to update the task

You have successfully submitted request SR986940908.

Mentioned in SAL (#wikimedia-operations) [2019-02-27T16:20:59Z] <volans> force-rebooting icinga1001 (to test some puppet changes) - T214760

@Cmjohnson my tests on icinga1001 are completed, so feel free to shutdown at will when parts are available.

The new CPU came in and I replaced CPU1

Return Shipping
USPS 9202 3946 5301 2441 1128 27
FEDEX 9611918 2393026 77862845

@Volans: Can you return this server to service so we can see if the problem has been resolved?

@RobH Icinga runs on both hosts and generates the same load, being active or passive changes very little. I've been monitoring this host with our beta icinga meta-monitoring and so far so good. It has now 11 days of uptime and I have a proposal for tomorrow's Foundation's meeting to failback to icinga1001 as active server either this Thu. or next Mon. given that is seems stable now.

I'll update this task after tomorrow's decision.

RobH changed the task status from Open to Stalled.Mar 12 2019, 6:40 PM
RobH lowered the priority of this task from High to Low.

I believe this is done now -- resolving