Page MenuHomePhabricator

db1069 (x1 master) memory errors
Closed, ResolvedPublic

Description

db1069 has reported memory errors

Service
Memory correctable errors -EDAC-
On Host
db1069

As per HP documentation we'd need to restart the host and upgrade the BIOS.
db1069 is x1 primary master, so let's wait until the DC failover so we can operate it without causing downtime.

Event Timeline

Marostegui changed the task status from Open to Stalled.Aug 3 2018, 5:22 AM
Marostegui triaged this task as Normal priority.
Marostegui created this task.
Marostegui moved this task from Triage to Next on the DBA board.
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Aug 3 2018, 2:44 PM
Marostegui closed this task as Resolved.Aug 7 2018, 1:06 PM

It recovered itself:

˜/icinga-wm 15:02> RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1069&service=Memory+correctable+errors+-EDAC-
Nothing on HW logs.

So I am going to close this for now and we'll see if it happens again.

Indeed it can happen since the alert is errors over four days, if no new errors come in the alert will recover

Dzahn reopened this task as Open.Oct 13 2018, 3:13 AM
Dzahn added a subscriber: Dzahn.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1069&service=Memory+correctable+errors+-EDAC-

Service
Memory correctable errors -EDAC-
On Host
db1069
(db1069)

Current Status: CRITICAL (for 0d 5h 4m 29s)
Status Information: 8.001 ge 4

Marostegui moved this task from Next to In progress on the DBA board.Oct 13 2018, 7:45 AM
Marostegui closed this task as Resolved.Oct 17 2018, 5:10 AM
Marostegui claimed this task.

As it happened before, this recovered itself - closing for now:

04:26 < icinga-wm> RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops

Just for the record

db1069

Memory correctable errors -EDAC-
WARNING	2019-02-20 10:45:24	2d 19h 28m 54s	3/3	2 ge 2