Page MenuHomePhabricator

db1117 went away
Closed, ResolvedPublic

Description

TL;TR: RAM issue in DIMM A3 & A7, need hw support.

On 2018-10-27 21:12 UTC the following alerts appeared on #wikimedia-operations irc channel:

+icinga-wm> IRC echo bot PROBLEM - Host db1117 is DOWN: PING CRITICAL - Packet loss = 100%
23:13 PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1

The host was down indeed, I was not able to open an ssh connection there.
I connected to the servers manamement interface, and started a cirtual console, but I was not able to see anything in the console itself, as it was blank, but the virtual console doesn't shown any activity.
I checked the hosts power status, and it was 'ON'.
I decided to do a power cycle on the host with:

racadm serveraction powercycle

After the powercycle the host boot up.
On the console I've seen that the prometheus exporter was not able to start

  OK  ] Started Prometheus exporter for MySQL server.
[  OK  ] Stopped Prometheus exporter for MySQL server.
[FAILED] Failed to start Prometheus exporter for MySQL server.
See 'systemctl status prometheus-mysqld-exporter.service' for details.
[  OK  ] Started LSB: exim Mail Transport Agent.

I started the hosts mariadb instances one-by one:

systemctl start mariadb@m1
systemctl start mariadb@m2
systemctl start mariadb@m3
systemctl start mariadb@m5

The services started up without any trouble. The replications started, the instances cathed up.
Regarding the output of journcalctl -u mariadb@m1 (m2, m3, m5) all recovery (when needed) finished.

All the icinga checks became OK except systemd state.
After investigating the prometheus-mysqld-exporter service, it was not able to start as it was not able to find /var/lib/prometheus/.my.cnf file.
I checked db2078 (multiinstance host too, takes place in the same cluster) and I checked the home /var/lib/prometheus directory for .my.cnf file. It was there, and it suggested that the mysqld instance is available under /run/mysqld/mysqld.sock - which is not true, there's no mysql instance having that socket.
I created manually the same .my.cnf file, and started the prometheus mysqld exporter which reported running state.
This is not good, we have to puppetize this.
After the server become available I started to investigate the root cause of the outage, but I was not able to find anything.
The syslog file has the following entries around the given timestamp:

Oct 27 21:08:44 db1117 systemd[1]: Started Time & Date Service.
Oct 27 21:09:01 db1117 CRON[22366]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 27 21:28:47 db1117 systemd-modules-load[873]: Inserted module 'nf_conntrack'
Oct 27 21:28:47 db1117 systemd-modules-load[873]: Inserted module 'ipmi_devintf'
Oct 27 21:28:47 db1117 systemd[1]: Started Load Kernel Modules.
Oct 27 21:28:47 db1117 systemd[1]: Started Remount Root and Kernel File Systems.
Oct 27 21:28:47 db1117 systemd[1]: Started Create list of required static device nodes for the current kernel.
Oct 27 21:28:47 db1117 systemd[1]: Started LVM2 metadata daemon.
Oct 27 21:28:47 db1117 systemd[1]: Starting Create Static Device Nodes in /dev...

The /var/log/messages file is also empty:

Oct 27 20:57:08 db1117 puppet-agent-cronjob: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server
Oct 27 21:28:47 db1117 kernel: [    0.000000] microcode: microcode updated early to revision 0xb00002e, date = 2018-04-19

In the prometheus graphs there's a some anomalies around that time period but I am not sure how to interpret them.

Event Timeline

I triaged this as 'high' as we have a server which went down, and we don't know the root cause.
I am not against lowering the priority or even marking the ticket resolved, but I need a second opinion about it.

mysql-prometheus-exporter should not run in a multiinstances host, there is mysql-prometheus@m1, @m2 ... That is specified on puppet.

That was what confused me, because after the host came back, there was this error on icinga, which I don't remember in the past, so I tried to run puppet if that will fix everything, but it didn't

jcrespo added a subscriber: Cmjohnson.

so the error happens because it is tried to be run manually, which it not a big deal if it errors out- just delete any file you may have added. I ran systemctl disable <service manually started that failed> and then systemctl reset-failedand it should never fail again.

From the logs:

 	Sat Oct 27 2018 21:11:38	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:11:38	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:11:38	A problem was detected in Memory Reference Code (MRC).	
	 	Sat Oct 27 2018 21:09:04	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:09:00	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:09:00	Correctable memory error rate exceeded for DIMM_A7.	
	 	Sat Oct 27 2018 21:08:57	Correctable memory error rate exceeded for DIMM_A7.
	  	2018-10-29T08:25:44-0500	USR0030	
Successfully logged in using root, from 10.64.32.25 and GUI.
	
 
 	  	2018-10-27T21:58:52-0500	USR0032	
The session for root from 208.80.154.86 using SSH is logged off.
	
 
 	  	2018-10-27T21:58:52-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-10-27T21:26:39-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-10-27T21:26:38-0500	SYS1000	
System is turning on.
	
 
 	  	2018-10-27T21:26:31-0500	RAC0702	
Requested system powercycle.
	
 
 	  	2018-10-27T21:26:25-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-10-27T21:26:25-0500	SYS1001	
System is turning off.
	
 
 	  	2018-10-27T21:17:20-0500	USR0030	
Successfully logged in using root, from 208.80.154.86 and SSH.
	
 
 	  	2018-10-27T21:11:50-0500	UEFI0081	
Memory size has changed from the last time the system was started.
	
 
 	  	2018-10-27T21:11:49-0500	UEFI0107	
One or more memory errors have occurred on memory slot: A3.
	
 
 	  	2018-10-27T21:11:42-0500	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
	
 
 	  	2018-10-27T21:11:41-0500	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
	
 
 	  	2018-10-27T21:11:38-0500	UEFI0058	
An uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning.
	
 
 	  	2018-10-27T21:11:38-0500	PST0091	
A problem was detected in Memory Reference Code (MRC).
	
 
 	  	2018-10-27T21:09:19-0500	PWR2264	
The Intel Management Engine has reported normal system operation.
	
 
 	  	2018-10-27T21:09:08-0500	RAC0703	
Requested system hardreset.
	
 
 	  	2018-10-27T21:09:08-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-10-27T21:09:07-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-10-27T21:09:07-0500	CPU0000	
Internal error has occurred check for additional logs.
	
 
 	  	2018-10-27T21:09:07-0500	PWR2262	
The Intel Management Engine has reported an internal system error.
	
 
 	  	2018-10-27T21:09:06-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-10-27T21:09:04-0500	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
	
 
 	  	2018-10-27T21:09:01-0500	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
	
 
 	  	2018-10-27T21:09:00-0500	MEM0702	
Correctable memory error rate exceeded for DIMM_A7.
	
 
 	  	2018-10-27T21:08:57-0500	MEM0701	
Correctable memory error rate exceeded for DIMM_A7.
	
 
 	  	2018-10-27T21:08:57-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-05-23T10:21:26-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-05-17T05:51:24-0500	BAT1034	
The battery properties for Integrated RAID Controller 1 have changed.
	
 
 	  	2018-04-26T08:11:31-0500	USR0032	
The session for root from 10.64.32.20 using IPMI over LAN is logged off.
	
 
 	  	2018-04-26T08:11:31-0500	USR0030	
Successfully logged in using root, from 10.64.32.20 and IPMI over LAN.

This is clearly a RAM issue. CC @Cmjohnson

What cli command did you use for getting those logs?

@jcrespo @Banyek It is not clearly a RAM issue it could be a CPU issue as well on CPU1 ....i will need to do a few things tot he server...Swap the supposedly bad DIMM to B side and see if the error follows. If the error follows the DIMM then it's obviously DIMM...if not, then it's most likely CPU and then I will need to swap the CPU's to see what happens. It could also be the system board (unlikely).

Can you bring this down for a few mins for me to move the DIMM around please?

Thanks

idrac log


Record: 2
Date/Time: 10/27/2018 21:08:57
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A7.

Record: 3
Date/Time: 10/27/2018 21:09:00
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A7.

Record: 4
Date/Time: 10/27/2018 21:09:00
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Record: 5
Date/Time: 10/27/2018 21:09:04
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Record: 6
Date/Time: 10/27/2018 21:11:38
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 7
Date/Time: 10/27/2018 21:11:38
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Record: 8
Date/Time: 10/27/2018 21:11:38
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Sure! Can I shut down the server now for you? @Cmjohnson

Mentioned in SAL (#wikimedia-operations) [2018-10-29T14:58:46Z] <banyek> shutting down db1117 for hardware maintenance (T208150)

It is not clearly a RAM issue

Sorry, my comment was in the context of "it is not a software issue" and "it expresses/reveals itself as a memory error", so we could discard MySQL issues, as it is what we normally suppose.

There is nothing else left for DBAs here except waiting for errors.

jcrespo lowered the priority of this task from High to Medium.Oct 30 2018, 2:38 PM
Marostegui assigned this task to Cmjohnson.

I see no more errors on the idrac logs since the reboot.
Let's close this and re-open if this happens again and then we'll need to get the vendor involved.

		EnabledState = 2 (Enabled)
		HealthState = 5 (OK)