Current status as of the writing of this task: 13 ge 4 Service Critical[2019-01-12 09:24:46] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;CRITICAL;HARD;3;4.001 ge 4 Service Warning[2019-01-12 04:22:59] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2 Service Warning[2019-01-12 04:17:45] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2 Service Warning[2019-01-12 04:12:33] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2 Service Ok[2018-12-19 03:24:29] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;OK;HARD;3;(C)4 ge (W)2 ge 1 Service Warning[2018-12-15 15:23:27] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2 Service Warning[2018-12-15 15:18:15] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2 Service Warning[2018-12-15 15:13:05] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Marostegui | T217396 Decommission db1061-db1073 | |||
Resolved | jcrespo | T213664 correctable memory errors db1068 (commons primary master database) |
Event Timeline
Those have showed up before and normally get corrected by themselves after a few days.
I guess these hosts are too old and are already showing, more often than usual, symptoms that they need to be retired which will happen once we have T211613: rack/setup/install db11[26-38].eqiad.wmnet ready
I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failover.
Change 484213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance
Change 484213 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance
Change 484368 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance
Change 484368 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance
Change 484399 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance
Change 484400 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4
Change 484399 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance
Change 484400 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4
The errors are mostly gone, resolving and keeping an eye on it in case it happens again.
[15:34] <moritzm> there's an EDAC Icinga alert for db1068, system is OOW, known/worth opening a Phab task? (sometimes we have the same DIMM module from a decomed server as a replacement) [15:37] <jynus> moritzm: https://phabricator.wikimedia.org/T213664 [15:37] <jynus> it comes and goes [15:38] <jynus> as long as it holds up, it has to wait until decommision T211613 [15:38] <stashbot> T211613: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613
And back again: RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1
As Jaime said: T213664#4924636 this won't be fully gone until this is fully decommissioned.
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1068&service=EDAC+syslog+messages
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1068&service=Memory+correctable+errors+-EDAC-
EDAC syslog messages & Memory correctable errors -EDAC- are alerting again on db1068
Thanks for letting us know!
This master will be replaced once the hosts at T211613: rack/setup/install db11[26-38].eqiad.wmnet are racked and installed.
It recovered again, needs replacement though as I'm sure it will become critical again soonish
Closing for now again until it happens again
06:06:27 <+ icinga-wm> RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops
For the record, the master failover for this host will be scheduled for the 19th June.