correctable memory errors db1068 (commons primary master database)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jan 13 2019, 7:27 PM

Description

Current status as of the writing of this task: 13 ge 4

Service Critical[2019-01-12 09:24:46] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;CRITICAL;HARD;3;4.001 ge 4


Service Warning[2019-01-12 04:22:59] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2
Service Warning[2019-01-12 04:17:45] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2
Service Warning[2019-01-12 04:12:33] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2

Service Ok[2018-12-19 03:24:29] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;OK;HARD;3;(C)4 ge (W)2 ge 1

Service Warning[2018-12-15 15:23:27] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;HARD;3;2 ge 2
Service Warning[2018-12-15 15:18:15] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;2;2 ge 2
Service Warning[2018-12-15 15:13:05] SERVICE ALERT: db1068;Memory correctable errors -EDAC-;WARNING;SOFT;1;2 ge 2

Details

Subject	Repo	Branch	Lines +/-
mariadb: Depool db1103 from s2 and s4	operations/mediawiki-config	master	+12 -12
mariadb: Repool db1091 with low load after maintenance	operations/mediawiki-config	master	+1 -1
mariadb: Depool db1091 for maintenance	operations/mediawiki-config	master	+1 -1
mariadb: Depool db1081 for maintenance	operations/mediawiki-config	master	+3 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Marostegui	T217396 Decommission db1061-db1073
		Resolved		jcrespo	T213664 correctable memory errors db1068 (commons primary master database)

Event Timeline

jcrespo created this task.Jan 13 2019, 7:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2019, 7:27 PM

Paladox added projects: SRE, DBA.Jan 13 2019, 7:29 PM

jcrespo renamed this task from correctable memory errors db1068 (commons primary master database to correctable memory errors db1068 (commons primary master database).Jan 13 2019, 7:30 PM

jcrespo updated the task description. (Show Details)

jcrespo added a subscriber: Marostegui.

Reedy added a project: ops-eqiad.Jan 13 2019, 7:31 PM

Those have showed up before and normally get corrected by themselves after a few days.
I guess these hosts are too old and are already showing, more often than usual, symptoms that they need to be retired which will happen once we have T211613: rack/setup/install db11[26-38].eqiad.wmnet ready

Removing ops-eqiad tag as there is no action needed from Chris as this point.

Marostegui moved this task from Triage to In progress on the DBA board.Jan 14 2019, 8:54 AM

I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failover.

Change 484213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance

https://gerrit.wikimedia.org/r/484213

gerritbot added a project: Patch-For-Review.Jan 14 2019, 12:26 PM

CDanis triaged this task as High priority.Jan 14 2019, 2:39 PM

Change 484213 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1081 for maintenance

https://gerrit.wikimedia.org/r/484213

Change 484368 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance

https://gerrit.wikimedia.org/r/484368

Change 484368 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1091 for maintenance

https://gerrit.wikimedia.org/r/484368

Change 484399 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance

https://gerrit.wikimedia.org/r/484399

Change 484400 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4

https://gerrit.wikimedia.org/r/484400

Change 484399 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db1091 with low load after maintenance

https://gerrit.wikimedia.org/r/484399

Change 484400 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1103 from s2 and s4

https://gerrit.wikimedia.org/r/484400

The errors are mostly gone, resolving and keeping an eye on it in case it happens again.

Seems like it is happening again

[times are CET]

[15:34] <moritzm> there's an EDAC Icinga alert for db1068, system is OOW, known/worth opening a Phab task? (sometimes we have the same DIMM module from a decomed server as a replacement)
[15:37] <jynus> moritzm: https://phabricator.wikimedia.org/T213664
[15:37] <jynus> it comes and goes
[15:38] <jynus> as long as it holds up, it has to wait until decommision T211613
[15:38] <stashbot> T211613: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613

T211613: rack/setup/install db11[26-38].eqiad.wmnet

And back again: RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1
As Jaime said: T213664#4924636 this won't be fully gone until this is fully decommissioned.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1068&service=EDAC+syslog+messages
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1068&service=Memory+correctable+errors+-EDAC-

EDAC syslog messages & Memory correctable errors -EDAC- are alerting again on db1068

Thanks for letting us know!
This master will be replaced once the hosts at T211613: rack/setup/install db11[26-38].eqiad.wmnet are racked and installed.

Marostegui added a parent task: T217396: Decommission db1061-db1073.May 7 2019, 10:05 AM

Dzahn unsubscribed.May 7 2019, 9:48 PM

It recovered again, needs replacement though as I'm sure it will become critical again soonish
Closing for now again until it happens again

06:06:27 <+ icinga-wm> RECOVERY - EDAC syslog messages on db1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops

It now says: CRITICAL: Devices (12) not equal to PDs (2)

Ignore the above, that is unrelated.

Marostegui mentioned this in T224516: Database primary master failover on s4 (commonswiki).May 28 2019, 7:19 PM

For the record, the master failover for this host will be scheduled for the 19th June.

Marostegui mentioned this in T224852: Failover s4 primary master: db1068 to db1081.Jun 3 2019, 5:40 AM

This host is no longer a master and will be decommissioned in a few days

Marostegui mentioned this in T226689: decommission db1068.Jul 1 2019, 5:25 AM

correctable memory errors db1068 (commons primary master database)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

correctable memory errors db1068 (commons primary master database)
Closed, ResolvedPublic
Actions

Related Objects
Search...