Page MenuHomePhabricator

correctable memory errors db1068 (commons primary master database)
Closed, ResolvedPublic

Description

Current status as of the writing of this task: 13 ge 4

Service Critical[2019-01-12 09:24:46] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;CRITICAL;HARD;3;4.001 ge 4


Service Warning[2019-01-12 04:22:59] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2
Service Warning[2019-01-12 04:17:45] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2
Service Warning[2019-01-12 04:12:33] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2

Service Ok[2018-12-19 03:24:29] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;OK;HARD;3;(C)4 ge (W)2 ge 1

Service Warning[2018-12-15 15:23:27] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2
Service Warning[2018-12-15 15:18:15] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2
Service Warning[2018-12-15 15:13:05] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2

Event Timeline

jcrespo renamed this task from correctable memory errors db1068 (commons primary master database to correctable memory errors db1068 (commons primary master database).Jan 13 2019, 7:30 PM
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: Marostegui.

Those have showed up before and normally get corrected by themselves after a few days.
I guess these hosts are too old and are already showing, more often than usual, symptoms that they need to be retired which will happen once we have T211613: rack/setup/install db11[26-38].eqiad.wmnet ready

Removing ops-eqiad tag as there is no action needed from Chris as this point.

I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failover.

Change 484213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance

https://gerrit.wikimedia.org/r/484213

CDanis triaged this task as High priority.Jan 14 2019, 2:39 PM

Change 484213 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance

https://gerrit.wikimedia.org/r/484213

Change 484368 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance

https://gerrit.wikimedia.org/r/484368

Change 484368 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance

https://gerrit.wikimedia.org/r/484368

Change 484399 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance

https://gerrit.wikimedia.org/r/484399

Change 484400 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4

https://gerrit.wikimedia.org/r/484400

Change 484399 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance

https://gerrit.wikimedia.org/r/484399

Change 484400 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4

https://gerrit.wikimedia.org/r/484400

jcrespo claimed this task.

The errors are mostly gone, resolving and keeping an eye on it in case it happens again.

CDanis subscribed.

Seems like it is happening again

[times are CET]
[15:34] <moritzm> there's an EDAC Icinga alert for db1068, system is OOW, known/worth opening a Phab task? (sometimes we have the same DIMM module from a decomed server as a replacement)
[15:37] <jynus> moritzm: https://phabricator.wikimedia.org/T213664
[15:37] <jynus> it comes and goes
[15:38] <jynus> as long as it holds up, it has to wait until decommision T211613
[15:38] <stashbot> T211613: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613

T211613: rack/setup/install db11[26-38].eqiad.wmnet

And back again: RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1
As Jaime said: T213664#4924636 this won't be fully gone until this is fully decommissioned.

Thanks for letting us know!
This master will be replaced once the hosts at T211613: rack/setup/install db11[26-38].eqiad.wmnet are racked and installed.

It recovered again, needs replacement though as I'm sure it will become critical again soonish
Closing for now again until it happens again

06:06:27 <+ icinga-wm> RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops

It now says: CRITICAL: Devices (12) not equal to PDs (2)

Ignore the above, that is unrelated.

For the record, the master failover for this host will be scheduled for the 19th June.

This host is no longer a master and will be decommissioned in a few days