Page MenuHomePhabricator

db1069 (x1 master) memory errors
Closed, ResolvedPublic

Description

db1069 has reported memory errors

Service
Memory correctable errors -EDAC-
On Host
db1069

As per HP documentation we'd need to restart the host and upgrade the BIOS.
db1069 is x1 primary master, so let's wait until the DC failover so we can operate it without causing downtime.

Event Timeline

Marostegui changed the task status from Open to Stalled.
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Pending comment on the DBA board.

It recovered itself:

˜/icinga-wm 15:02> RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1069&service=Memory+correctable+errors+-EDAC-
Nothing on HW logs.

So I am going to close this for now and we'll see if it happens again.

Indeed it can happen since the alert is errors over four days, if no new errors come in the alert will recover

Dzahn subscribed.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1069&service=Memory+correctable+errors+-EDAC-

Service
Memory correctable errors -EDAC-
On Host
db1069
(db1069)

Current Status: CRITICAL (for 0d 5h 4m 29s)
Status Information: 8.001 ge 4

Marostegui claimed this task.

As it happened before, this recovered itself - closing for now:

04:26 < icinga-wm> RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops

Just for the record

db1069

Memory correctable errors -EDAC-
WARNING	2019-02-20 10:45:24	2d 19h 28m 54s	3/3	2 ge 2