Maniphest T208150

db1117 went away
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Banyek
	Oct 27 2018, 10:42 PM

Description

TL;TR: RAM issue in DIMM A3 & A7, need hw support.

On 2018-10-27 21:12 UTC the following alerts appeared on #wikimedia-operations irc channel:

+icinga-wm> IRC echo bot PROBLEM - Host db1117 is DOWN: PING CRITICAL - Packet loss = 100%
23:13 PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1
23:13 PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1

The host was down indeed, I was not able to open an ssh connection there.
I connected to the servers manamement interface, and started a cirtual console, but I was not able to see anything in the console itself, as it was blank, but the virtual console doesn't shown any activity.
I checked the hosts power status, and it was 'ON'.
I decided to do a power cycle on the host with:

racadm serveraction powercycle

After the powercycle the host boot up.
On the console I've seen that the prometheus exporter was not able to start

  OK  ] Started Prometheus exporter for MySQL server.
[  OK  ] Stopped Prometheus exporter for MySQL server.
[FAILED] Failed to start Prometheus exporter for MySQL server.
See 'systemctl status prometheus-mysqld-exporter.service' for details.
[  OK  ] Started LSB: exim Mail Transport Agent.

I started the hosts mariadb instances one-by one:

systemctl start mariadb@m1
systemctl start mariadb@m2
systemctl start mariadb@m3
systemctl start mariadb@m5

The services started up without any trouble. The replications started, the instances cathed up.
Regarding the output of journcalctl -u mariadb@m1 (m2, m3, m5) all recovery (when needed) finished.

All the icinga checks became OK except systemd state.
After investigating the prometheus-mysqld-exporter service, it was not able to start as it was not able to find /var/lib/prometheus/.my.cnf file.
I checked db2078 (multiinstance host too, takes place in the same cluster) and I checked the home /var/lib/prometheus directory for .my.cnf file. It was there, and it suggested that the mysqld instance is available under /run/mysqld/mysqld.sock - which is not true, there's no mysql instance having that socket.
I created manually the same .my.cnf file, and started the prometheus mysqld exporter which reported running state.
This is not good, we have to puppetize this.
After the server become available I started to investigate the root cause of the outage, but I was not able to find anything.
The syslog file has the following entries around the given timestamp:

Oct 27 21:08:44 db1117 systemd[1]: Started Time & Date Service.
Oct 27 21:09:01 db1117 CRON[22366]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 27 21:28:47 db1117 systemd-modules-load[873]: Inserted module 'nf_conntrack'
Oct 27 21:28:47 db1117 systemd-modules-load[873]: Inserted module 'ipmi_devintf'
Oct 27 21:28:47 db1117 systemd[1]: Started Load Kernel Modules.
Oct 27 21:28:47 db1117 systemd[1]: Started Remount Root and Kernel File Systems.
Oct 27 21:28:47 db1117 systemd[1]: Started Create list of required static device nodes for the current kernel.
Oct 27 21:28:47 db1117 systemd[1]: Started LVM2 metadata daemon.
Oct 27 21:28:47 db1117 systemd[1]: Starting Create Static Device Nodes in /dev...

The /var/log/messages file is also empty:

Oct 27 20:57:08 db1117 puppet-agent-cronjob: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server
Oct 27 21:28:47 db1117 kernel: [    0.000000] microcode: microcode updated early to revision 0xb00002e, date = 2018-04-19

In the prometheus graphs there's a some anomalies around that time period but I am not sure how to interpret them.

Related Objects

Mentioned In: T208151: prometheus-mysqld-exporter not starts on db1117

Event Timeline

• Banyek created this task.Oct 27 2018, 10:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2018, 10:42 PM

• Banyek mentioned this in T208151: prometheus-mysqld-exporter not starts on db1117.Oct 27 2018, 10:50 PM

I triaged this as 'high' as we have a server which went down, and we don't know the root cause.
I am not against lowering the priority or even marking the ticket resolved, but I need a second opinion about it.

• Banyek moved this task from Triage to In progress on the DBA board.Oct 27 2018, 11:06 PM

mysql-prometheus-exporter should not run in a multiinstances host, there is mysql-prometheus@m1, @m2 ... That is specified on puppet.

That was what confused me, because after the host came back, there was this error on icinga, which I don't remember in the past, so I tried to run puppet if that will fix everything, but it didn't

so the error happens because it is tried to be run manually, which it not a big deal if it errors out- just delete any file you may have added. I ran systemctl disable <service manually started that failed> and then systemctl reset-failedand it should never fail again.

From the logs:

 	Sat Oct 27 2018 21:11:38	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:11:38	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:11:38	A problem was detected in Memory Reference Code (MRC).	
	 	Sat Oct 27 2018 21:09:04	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:09:00	Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.	
	 	Sat Oct 27 2018 21:09:00	Correctable memory error rate exceeded for DIMM_A7.	
	 	Sat Oct 27 2018 21:08:57	Correctable memory error rate exceeded for DIMM_A7.

2018-10-29T08:25:44-0500 USR0030
Successfully logged in using root, from 10.64.32.25 and GUI.

2018-10-27T21:58:52-0500 USR0032
The session for root from 208.80.154.86 using SSH is logged off.

2018-10-27T21:58:52-0500 LOG007
The previous log entry was repeated 1 times.

2018-10-27T21:26:39-0500 SYS1003
System CPU Resetting.

2018-10-27T21:26:38-0500 SYS1000
System is turning on.

2018-10-27T21:26:31-0500 RAC0702
Requested system powercycle.

2018-10-27T21:26:25-0500 SYS1003
System CPU Resetting.

2018-10-27T21:26:25-0500 SYS1001
System is turning off.

2018-10-27T21:17:20-0500 USR0030
Successfully logged in using root, from 208.80.154.86 and SSH.

2018-10-27T21:11:50-0500 UEFI0081
Memory size has changed from the last time the system was started.

2018-10-27T21:11:49-0500 UEFI0107
One or more memory errors have occurred on memory slot: A3.

2018-10-27T21:11:42-0500 MEM0001
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

2018-10-27T21:11:41-0500 MEM0001
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

2018-10-27T21:11:38-0500 UEFI0058
An uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning.

2018-10-27T21:11:38-0500 PST0091
A problem was detected in Memory Reference Code (MRC).

2018-10-27T21:09:19-0500 PWR2264
The Intel Management Engine has reported normal system operation.

2018-10-27T21:09:08-0500 RAC0703
Requested system hardreset.

2018-10-27T21:09:08-0500 LOG007
The previous log entry was repeated 1 times.

2018-10-27T21:09:07-0500 SYS1003
System CPU Resetting.

2018-10-27T21:09:07-0500 CPU0000
Internal error has occurred check for additional logs.

2018-10-27T21:09:07-0500 PWR2262
The Intel Management Engine has reported an internal system error.

2018-10-27T21:09:06-0500 SYS1003
System CPU Resetting.

2018-10-27T21:09:04-0500 MEM0001
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

2018-10-27T21:09:01-0500 MEM0001
Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

2018-10-27T21:09:00-0500 MEM0702
Correctable memory error rate exceeded for DIMM_A7.

2018-10-27T21:08:57-0500 MEM0701
Correctable memory error rate exceeded for DIMM_A7.

2018-10-27T21:08:57-0500 LOG007
The previous log entry was repeated 1 times.

2018-05-23T10:21:26-0500 SYS1003
System CPU Resetting.

2018-05-17T05:51:24-0500 BAT1034
The battery properties for Integrated RAID Controller 1 have changed.

2018-04-26T08:11:31-0500 USR0032
The session for root from 10.64.32.20 using IPMI over LAN is logged off.

2018-04-26T08:11:31-0500 USR0030
Successfully logged in using root, from 10.64.32.20 and IPMI over LAN.

This is clearly a RAM issue. CC @Cmjohnson

Restricted Application added a project: SRE. · View Herald TranscriptOct 29 2018, 8:33 AM

jcrespo updated the task description. (Show Details)Oct 29 2018, 8:34 AM

What cli command did you use for getting those logs?

@jcrespo @Banyek It is not clearly a RAM issue it could be a CPU issue as well on CPU1 ....i will need to do a few things tot he server...Swap the supposedly bad DIMM to B side and see if the error follows. If the error follows the DIMM then it's obviously DIMM...if not, then it's most likely CPU and then I will need to swap the CPU's to see what happens. It could also be the system board (unlikely).

Can you bring this down for a few mins for me to move the DIMM around please?

Thanks

idrac log

Record: 2
Date/Time: 10/27/2018 21:08:57
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A7.

Record: 3
Date/Time: 10/27/2018 21:09:00
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A7.

Record: 4
Date/Time: 10/27/2018 21:09:00
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Record: 5
Date/Time: 10/27/2018 21:09:04
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Record: 6
Date/Time: 10/27/2018 21:11:38
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 7
Date/Time: 10/27/2018 21:11:38
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Record: 8
Date/Time: 10/27/2018 21:11:38
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Oct 29 2018, 2:52 PM

Sure! Can I shut down the server now for you? @Cmjohnson

Mentioned in SAL (#wikimedia-operations) [2018-10-29T14:58:46Z] <banyek> shutting down db1117 for hardware maintenance (T208150)

It is not clearly a RAM issue

Sorry, my comment was in the context of "it is not a software issue" and "it expresses/reveals itself as a memory error", so we could discard MySQL issues, as it is what we normally suppose.

There is nothing else left for DBAs here except waiting for errors.

jcrespo lowered the priority of this task from High to Medium.Oct 30 2018, 2:38 PM

I see no more errors on the idrac logs since the reboot.
Let's close this and re-open if this happens again and then we'll need to get the vendor involved.

		EnabledState = 2 (Enabled)
		HealthState = 5 (OK)

db1117 went awayClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Description: Correctable memory error rate exceeded for DIMM_A7.

Description: Correctable memory error rate exceeded for DIMM_A7.

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Description: A problem was detected in Memory Reference Code (MRC).

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

db1117 went away
Closed, ResolvedPublic
Actions