Page MenuHomePhabricator

correctable memory errors db1068 (commons primary master database)
Closed, ResolvedPublic

Description

Current status as of the writing of this task: 13 ge 4

Service Critical[2019-01-12 09:24:46] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;CRITICAL;HARD;3;4.001 ge 4


Service Warning[2019-01-12 04:22:59] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2
Service Warning[2019-01-12 04:17:45] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2
Service Warning[2019-01-12 04:12:33] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2

Service Ok[2018-12-19 03:24:29] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;OK;HARD;3;(C)4 ge (W)2 ge 1

Service Warning[2018-12-15 15:23:27] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2
Service Warning[2018-12-15 15:18:15] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2
Service Warning[2018-12-15 15:13:05] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2

Details

Related Gerrit Patches:
operations/mediawiki-config : mastermariadb: Depool db1103 from s2 and s4
operations/mediawiki-config : mastermariadb: Repool db1091 with low load after maintenance
operations/mediawiki-config : mastermariadb: Depool db1091 for maintenance
operations/mediawiki-config : mastermariadb: Depool db1081 for maintenance

Event Timeline

jcrespo created this task.Jan 13 2019, 7:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2019, 7:27 PM
jcrespo renamed this task from correctable memory errors db1068 (commons primary master database to correctable memory errors db1068 (commons primary master database).Jan 13 2019, 7:30 PM
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: Marostegui.
Marostegui added a comment.EditedJan 13 2019, 8:09 PM

Those have showed up before and normally get corrected by themselves after a few days.
I guess these hosts are too old and are already showing, more often than usual, symptoms that they need to be retired which will happen once we have T211613: rack/setup/install db11[26-38].eqiad.wmnet ready

Removing ops-eqiad tag as there is no action needed from Chris as this point.

Marostegui moved this task from Triage to In progress on the DBA board.Jan 14 2019, 8:54 AM

I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failover.

Change 484213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance

https://gerrit.wikimedia.org/r/484213

CDanis triaged this task as High priority.Jan 14 2019, 2:39 PM

Change 484213 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance

https://gerrit.wikimedia.org/r/484213

Change 484368 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance

https://gerrit.wikimedia.org/r/484368

Change 484368 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance

https://gerrit.wikimedia.org/r/484368

Change 484399 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance

https://gerrit.wikimedia.org/r/484399

Change 484400 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4

https://gerrit.wikimedia.org/r/484400

Change 484399 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance

https://gerrit.wikimedia.org/r/484399

Change 484400 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4

https://gerrit.wikimedia.org/r/484400

jcrespo closed this task as Resolved.Jan 18 2019, 10:47 AM
jcrespo claimed this task.

The errors are mostly gone, resolving and keeping an eye on it in case it happens again.

CDanis reopened this task as Open.Feb 4 2019, 2:58 PM
CDanis added a subscriber: CDanis.

Seems like it is happening again

[times are CET]
[15:34] <moritzm> there's an EDAC Icinga alert for db1068, system is OOW, known/worth opening a Phab task? (sometimes we have the same DIMM module from a decomed server as a replacement)
[15:37] <jynus> moritzm: https://phabricator.wikimedia.org/T213664
[15:37] <jynus> it comes and goes
[15:38] <jynus> as long as it holds up, it has to wait until decommision T211613
[15:38] <stashbot> T211613: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613

T211613: rack/setup/install db11[26-38].eqiad.wmnet

Marostegui closed this task as Resolved.Feb 7 2019, 9:02 AM

And back again: RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1
As Jaime said: T213664#4924636 this won't be fully gone until this is fully decommissioned.

Thanks for letting us know!
This master will be replaced once the hosts at T211613: rack/setup/install db11[26-38].eqiad.wmnet are racked and installed.

Dzahn removed a subscriber: Dzahn.May 7 2019, 9:48 PM
Marostegui closed this task as Resolved.May 11 2019, 5:12 AM

It recovered again, needs replacement though as I'm sure it will become critical again soonish
Closing for now again until it happens again

06:06:27 <+ icinga-wm> RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops

It now says: CRITICAL: Devices (12) not equal to PDs (2)

Ignore the above, that is unrelated.

For the record, the master failover for this host will be scheduled for the 19th June.

This host is no longer a master and will be decommissioned in a few days