Page MenuHomePhabricator

db1099 memory issues
Closed, ResolvedPublic

Description

db1099 is having memory issues:

racadm>>racadm getsel

racadm getsel
Record:      1
Date/Time:   03/30/2017 16:15:55
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/20/2019 13:33:36
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/20/2019 13:56:56
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------

Should we exchange this module with another existing one to see if it start reporting errors on a different location to discard memory slot issues?

Event Timeline

Change 505481 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1099

https://gerrit.wikimedia.org/r/505481

Change 505481 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1099

https://gerrit.wikimedia.org/r/505481

Mentioned in SAL (#wikimedia-operations) [2019-04-22T05:25:52Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1099 T221502 (duration: 01m 15s)

Mentioned in SAL (#wikimedia-operations) [2019-04-22T05:25:59Z] <marostegui> Stop MySQL and reboot db1099 to see if memory errors clear up T221502

I rebooted the host to see if the memory errors would clear up, but it didn't happen, so I guess we have to either contact Dell or move the DIMM to a different slot and wait to see if it happens again on a different location
@Cmjohnson please advise

This host recovered itself, so closing for now as nothing is to be done.

Change 513278 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1099

https://gerrit.wikimedia.org/r/513278

Change 513278 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1099

https://gerrit.wikimedia.org/r/513278

Mentioned in SAL (#wikimedia-operations) [2019-05-30T13:02:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1099 T221502 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-05-30T13:03:20Z] <marostegui> Stop MySQL on db1099 for onsite maintenance - T221502

re-opening as this is going to be worked out.
MySQL is stopped on s1 and s8, host downtimed and OS upgraded. It can be taken by @Cmjohnson anytime.
Please power it back on when done and comment here so I can repool

Thanks!

Mentioned in SAL (#wikimedia-operations) [2019-05-30T15:26:54Z] <cmjohnson1> shutting down db1099 to swap DIMM T221502

Swapped DIMM A5 with DIMM B5 and cleared the racadm log.

mysql started and replication catching up

Change 513304 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1099

https://gerrit.wikimedia.org/r/513304

Change 513304 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1099

https://gerrit.wikimedia.org/r/513304

Server has been fully repooled by Jaime by pushing T221502#5225565