Page MenuHomePhabricator

db1115 (tendril DB) had OOM for some processes and some hw (memory) issues
Closed, ResolvedPublic

Description

db1115 had out of memory (MySQL didn't crash):

Jun 08 06:14:58 db1115 puppet-agent[9904]: Failed to apply catalog: Cannot allocate memory -  /usr/bin/dpkg-query -W --showformat '${Status} ${Package} ${Version}\n' 2>&1


On 16 Dec 2018, 32 GB of memory "disappeared" and memory errors were detected.
Jun 08 06:15:02 db1115 nrpe[892]: fork() failed with error 12, bailing out...

Jun 08 06:15:03 db1115 systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Jun 08 06:15:03 db1115 systemd[1]: nagios-nrpe-server.service: Failed to fork: Cannot allocate memory

Jun 08 06:15:03 db1115 systemd[1]: nagios-nrpe-server.service: Failed to run 'stop-post' task: Cannot allocate memory

Jun 08 06:15:03 db1115 systemd[1]: nagios-nrpe-server.service: Unit entered failed state.

Jun 08 06:15:03 db1115 systemd[1]: nagios-nrpe-server.service: Failed with result 'exit-code'.

Jun 08 06:15:04 db1115 cron[25646]: /usr/sbin/sendmail: Cannot allocate memory

https://grafana.wikimedia.org/render/dashboard-solo/file/server-board.json?refresh=1m&panelId=14&orgId=1&var-server=db1115&var-network=eth0&from=1525847607665&to=1528439607665&width=1000&height=500&tz=UTC%2B02%3A00

https://grafana.wikimedia.org/render/dashboard-solo/file/server-board.json?refresh=1m&panelId=14&orgId=1&var-server=db1115&var-network=eth0&from=1528436037689&to=1528439637689&width=1000&height=500&tz=UTC%2B02%3A00

It started swapping the 22nd at 12:45 apparently: https://grafana.wikimedia.org/render/dashboard-solo/file/server-board.json?refresh=1m&panelId=16&orgId=1&var-server=db1115&var-network=eth0&from=1525847676603&to=1528439676604&width=1000&height=500&tz=UTC%2B02%3A00

MySQL pool size is only 10G

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2018, 6:35 AM
Marostegui updated the task description. (Show Details)Jun 8 2018, 6:36 AM
Marostegui triaged this task as Normal priority.Jun 8 2018, 6:48 AM

Nothing on HW logs - I checked them just in case there was something.

Marostegui moved this task from Triage to In progress on the DBA board.Jun 8 2018, 7:00 AM
Marostegui updated the task description. (Show Details)Jun 8 2018, 7:02 AM

Just to be clear, MySQL didn't crash.

mysql:root@localhost [(none)]> show global status like 'Uptime';
+---------------+---------+
| Variable_name | Value   |
+---------------+---------+
| Uptime        | 8621979 |
+---------------+---------+
1 row in set (0.01 sec)
Marostegui renamed this task from db1115 (tendril DB) crashed with OOM to db1115 (tendril DB) had OOM for some processes.Jun 8 2018, 7:06 AM
Marostegui updated the task description. (Show Details)
Marostegui closed this task as Resolved.Jun 8 2018, 7:31 AM
Marostegui claimed this task.

I am going to close this, as there is not much else I can debug to see what caused it. At least we have it tracked now and re-open if it happens again

Vvjjkkii renamed this task from db1115 (tendril DB) had OOM for some processes to pebaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Marostegui as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Marostegui renamed this task from pebaaaaaaa to db1115 (tendril DB) had OOM for some processes.Jul 2 2018, 4:58 AM
Marostegui closed this task as Resolved.
Marostegui claimed this task.
Marostegui lowered the priority of this task from High to Normal.
Marostegui updated the task description. (Show Details)

This happened again Sunday 16th Dec around 7:14AM (UTC) and paged due to nagios restarting (also tendril stopped working):

07:17 <+icinga-wm> PROBLEM - puppet last run on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused
07:17 <+icinga-wm> PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused
07:17 <+icinga-wm> PROBLEM - mysqld processes on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused
07:17 <+icinga-wm> PROBLEM - Disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused
07:17 <+icinga-wm> PROBLEM - DPKG on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused
07:17 <+icinga-wm> PROBLEM - Check size of conntrack table on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused
07:17 <+icinga-wm> PROBLEM - MD RAID on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused
07:18 < marostegui> MySQL is up and with a high uptime
07:21 < apergos> Dec 16 07:20:00 db1115 systemd[1]: Reloading.
07:21 < apergos> Dec 16 07:20:00 db1115 systemd[1]: Starting Nagios Remote Plugin Executor...
07:25 < marostegui> Dec 16 07:14:42 db1115 puppet-agent[17019]: Failed to apply catalog: Cannot allocate memory -  /usr/bin/dpkg-query -W --showformat '${Status} ${Package} ${Version}\n' 2>&1
07:28 < marostegui> !log Stop MySQL on db1115 so tendril can get back to work

Tendril stopped working and I had to restart MySQL to get Tendril up again (also upgraded kernel and mysql)
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=1544924936845&to=1544957192845

Dec 16 07:14:42 db1115 puppet-agent[17019]: Failed to apply catalog: Cannot allocate memory -  /usr/bin/dpkg-query -W --showformat '${Status} ${Package} ${Version}\n' 2>&1
Dec 16 07:14:53 db1115 nrpe[30452]: fork() failed with error 12, bailing out...
Dec 16 07:14:53 db1115 systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 16 07:14:53 db1115 systemd[1]: nagios-nrpe-server.service: Unit entered failed state.
Dec 16 07:14:53 db1115 systemd[1]: nagios-nrpe-server.service: Failed with result 'exit-code'.
jcrespo reopened this task as Open.Jan 11 2019, 3:38 PM
jcrespo removed Marostegui as the assignee of this task.
jcrespo added a subscriber: jcrespo.

Something else happened on the 17th:

Total memory went down to 96GB:

root@db1115:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:          96594       80682         485         941       15426       14060
Swap:           953         858          94
jcrespo assigned this task to Cmjohnson.Jan 11 2019, 3:50 PM
jcrespo added a project: ops-eqiad.
jcrespo added a subscriber: Cmjohnson.

Asking @Cmjohnson to move around this memory module- either it got disconnected or broken completely- FYI 128 GB of memory should be detected, but only 96 are (aiming for maintenance for Tuesday):

Sun Dec 16 2018 07:33:18	Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.	
Sun Dec 16 2018 07:33:18	A problem was detected in Memory Reference Code (MRC).	
Sun Dec 16 2018 07:33:18	Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.	
Sun Dec 16 2018 07:33:18	A problem was detected in Memory Reference Code (MRC).	
Sun Jul 22 2018 19:05:40	Correctable memory error rate exceeded for DIMM_B2.	
Sun Jul 22 2018 17:39:17	Correctable memory error rate exceeded for DIMM_B2.
Restricted Application added a project: Operations. · View Herald TranscriptJan 11 2019, 3:50 PM
jcrespo renamed this task from db1115 (tendril DB) had OOM for some processes to db1115 (tendril DB) had OOM for some processes and some hw (memory) issues.Jan 11 2019, 3:54 PM
jcrespo updated the task description. (Show Details)

More logs, confirming the module is probably dead:

 	  	2018-12-16T07:40:35-0600	PR8	
**Device not detected: DDR4 DIMM(Socket B2)**
	
 
 	  	2018-12-16T07:39:31-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2018-12-16T07:39:30-0600	RAC0703	
Requested system hardreset.
	
 
 	  	2018-12-16T07:39:29-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2018-12-16T07:38:27-0600	USR0030	
Successfully logged in using root, from 208.80.153.5 and SSH.
	
 
 	  	2018-12-16T07:38:18-0600	USR0032	
The session for root from 208.80.153.5 using SSH is logged off.
	
 
 	  	2018-12-16T07:36:58-0600	USR0030	
Successfully logged in using root, from 208.80.153.5 and SSH.
	
 
 	  	2018-12-16T07:33:30-0600	UEFI0081	
Memory size has changed from the last time the system was started.
	
 
 	  	2018-12-16T07:33:30-0600	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-12-16T07:33:30-0600	UEFI0107	
One or more memory errors have occurred on memory slot: B2.
	
 
 	  	2018-12-16T07:33:22-0600	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
	
 
 	  	2018-12-16T07:33:21-0600	PST0091	
A problem was detected in Memory Reference Code (MRC).
	
 
 	  	2018-12-16T07:33:20-0600	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
	
 
 	  	2018-12-16T07:33:18-0600	UEFI0058	
An uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning.
	
 
 	  	2018-12-16T07:33:18-0600	PST0091	
A problem was detected in Memory Reference Code (MRC).
	
 
 	  	2018-12-16T07:33:18-0600	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-12-16T06:27:43-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2018-09-26T07:43:47-0500	USR0032	
The session for root from 208.80.154.86 using SSH is logged off.
	
 
 	  	2018-09-26T07:35:39-0500	USR0030	
Successfully logged in using root, from 208.80.154.86 and SSH.
	
 
 	  	2018-09-26T07:35:19-0500	USR0032	
The session for root from 208.80.154.86 using SSH is logged off.
	
 
 	  	2018-09-26T07:35:19-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-09-26T07:28:00-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-09-26T07:20:12-0500	USR0030	
Successfully logged in using root, from 208.80.154.86 and SSH.
	
 
 	  	2018-07-22T19:05:40-0500	MEM0702	
Correctable memory error rate exceeded for DIMM_B2.
	
 
 	  	2018-07-22T17:39:17-0500	MEM0701	
Correctable memory error rate exceeded for DIMM_B2.
	
 
 	  	2018-07-22T17:39:17-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-07-06T10:07:49-0500	SYS1003	
System CPU Resetting.

This host should also be under warranty I guess?

@Cmjohnson can we request a new DIMM to Dell?

@Marostegui I have to do move the DIMM to another slot and see if the error corrects itself moves with the DIMM or remains the same. Can you take this server down and out of rotation Tuesday? I can do this around the same time as es1019

Yep, we can do that!
Just ping us when you are ready for it
Thanks!

Mentioned in SAL (#wikimedia-operations) [2019-01-15T13:53:08Z] <marostegui> Downtime db1115 and es1019 for 4 hours - T196726 T213422

@Cmjohnson The most likely scenario is that we move the dimm and we keep detecting 96GB of ram, and then we will ask you to ask for a replacement. Otherwise we will reboot it and keep observing.

racadm SEL

root@db1115.mgmt.eqiad.wmnet's password:
/admin1-> racadm getsel
Record: 1
Date/Time: 01/18/2018 11:23:25
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 07/22/2018 17:39:17
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B2.

Record: 3
Date/Time: 07/22/2018 19:05:40
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B2.

Record: 4
Date/Time: 12/16/2018 07:33:18
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 5
Date/Time: 12/16/2018 07:33:18
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.

Record: 6
Date/Time: 12/16/2018 07:33:18
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 7
Date/Time: 12/16/2018 07:33:18
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.

i swapped the dimm from a2 to b2 and cleared the log. Please put back in the rotation and let's see if and where the error occurs.

jcrespo changed the task status from Open to Stalled.Jan 16 2019, 3:10 PM

stalling, no errors so far, but I doubt this is the last time we hear abut this. Backups are on dbstore1001 just in case.

Marostegui closed this task as Resolved.Jan 26 2019, 3:02 AM

I am going to close this for now, as it has been 10 days without issues:

root@db1115:~# free -g
              total        used        free      shared  buff/cache   available
Mem:            125          65           1           1          59          58
Swap:             0