Page MenuHomePhabricator

Investigate and repool db1134
Closed, ResolvedPublic

Description

db1134 paged for high replication lag:

22:07:54 <+icinga-wm> PROBLEM - MariaDB Replica Lag: s1 #page on db1134 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1468.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

We depooled it: !log rzl@cumin1001 dbctl commit (dc=all): 'depool db1134', diff saved to https://phabricator.wikimedia.org/P14310 and previous config saved to /var/cache/conftool/dbconfig/20210211-031048-rzl.json

As of this writing, still digging into why, although the graphs suggest whatever happened happened at 2245: https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1134&var-port=9104&from=1613006340000&to=1613013540000

And @colewhite notes Feb 11 02:42:44 db1134 mysqld[3122]: 210211 2:42:44 [ERROR] mysqld got signal 7, and that the syslog indicates memory corruption.

DBA, I'm leaving it depooled overnight for you to investigate during your working hours. <3

Logs:

Feb 11 02:42:44 db1134 kernel: [1694159.917831] {1}Hardware error detected on CPU12
Feb 11 02:42:44 db1134 kernel: [1694159.917842] {1}event severity: recoverable
Feb 11 02:42:44 db1134 kernel: [1694159.917843] {1} Error 0, type: recoverable
Feb 11 02:42:44 db1134 kernel: [1694159.917844] {1} fru_text: B3
Feb 11 02:42:44 db1134 kernel: [1694159.917844] {1}  section_type: memory error
Feb 11 02:42:44 db1134 kernel: [1694159.917845] {1}  error_status: 0x0000000000000400
Feb 11 02:42:44 db1134 kernel: [1694159.917846] {1}  physical_address: 0x0000006901926840
Feb 11 02:42:44 db1134 kernel: [1694159.917848] {1}  node: 2 card: 2 module: 0 rank: 1 bank: 0 row: 675 column: 0 
Feb 11 02:42:44 db1134 kernel: [1694159.917850] {1}  DIMM location: not present. DMI handle: 0x0000 
Feb 11 02:42:44 db1134 kernel: [1694159.919460] Memory failure: 0x6901926: Killing mysqld:3122 due to hardware memory corruption
Feb 11 02:42:44 db1134 kernel: [1694159.928058] Memory failure: 0x6901926: recovery action for dirty LRU page: Recovered
Feb 11 02:42:44 db1134 mysqld[3122]: 210211  2:42:44 [ERROR] mysqld got signal 7 ;
racadm getsel

Record:      11
Date/Time:   02/11/2021 01:38:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   02/11/2021 01:38:44
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_B3.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   02/11/2021 02:42:43
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Event Timeline

Feb 11 02:42:44 db1134 kernel: [1694159.910376] Disabling lock debugging due to kernel taint
Feb 11 02:42:44 db1134 kernel: [1694159.910497] mce: [Hardware Error]: Machine check events logged
Feb 11 02:42:44 db1134 kernel: [1694159.910510] mce: Uncorrected hardware memory error in user-access at 6901926840
Feb 11 02:42:44 db1134 kernel: [1694159.917831] {1}Hardware error detected on CPU12
Feb 11 02:42:44 db1134 kernel: [1694159.917842] {1}event severity: recoverable
Feb 11 02:42:44 db1134 kernel: [1694159.917843] {1} Error 0, type: recoverable
Feb 11 02:42:44 db1134 kernel: [1694159.917844] {1} fru_text: B3
Feb 11 02:42:44 db1134 kernel: [1694159.917844] {1}  section_type: memory error
Feb 11 02:42:44 db1134 kernel: [1694159.917845] {1}  error_status: 0x0000000000000400
Feb 11 02:42:44 db1134 kernel: [1694159.917846] {1}  physical_address: 0x0000006901926840
Feb 11 02:42:44 db1134 kernel: [1694159.917848] {1}  node: 2 card: 2 module: 0 rank: 1 bank: 0 row: 675 column: 0 
Feb 11 02:42:44 db1134 kernel: [1694159.917850] {1}  DIMM location: not present. DMI handle: 0x0000 
Feb 11 02:42:44 db1134 kernel: [1694159.919460] Memory failure: 0x6901926: Killing mysqld:3122 due to hardware memory corruption
Feb 11 02:42:44 db1134 kernel: [1694159.928058] Memory failure: 0x6901926: recovery action for dirty LRU page: Recovered
Feb 11 02:42:44 db1134 mysqld[3122]: 210211  2:42:44 [ERROR] mysqld got signal 7 ;
Feb 11 02:42:44 db1134 mysqld[3122]: This could be because you hit a bug. It is also possible that this binary
Feb 11 02:42:44 db1134 mysqld[3122]: or one of the libraries it was linked against is corrupt, improperly built,
Feb 11 02:42:44 db1134 mysqld[3122]: or misconfigured. This error can also be caused by malfunctioning hardware.
Feb 11 02:42:44 db1134 mysqld[3122]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Feb 11 02:42:44 db1134 mysqld[3122]: We will try our best to scrape up some info that will hopefully help
Feb 11 02:42:44 db1134 mysqld[3122]: diagnose the problem, but since we have already crashed,
Feb 11 02:42:44 db1134 mysqld[3122]: something is definitely wrong and this may fail.
Feb 11 02:42:44 db1134 mysqld[3122]: Server version: 10.1.44-MariaDB
Feb 11 02:42:44 db1134 mysqld[3122]: key_buffer_size=134217728
Feb 11 02:42:44 db1134 mysqld[3122]: read_buffer_size=131072
Feb 11 02:42:44 db1134 mysqld[3122]: max_used_connections=3319
Feb 11 02:42:44 db1134 mysqld[3122]: max_threads=2010
Feb 11 02:42:44 db1134 mysqld[3122]: thread_count=255
Feb 11 02:42:44 db1134 mysqld[3122]: It is possible that mysqld could use up to
Feb 11 02:42:44 db1134 mysqld[3122]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4712695 K  bytes of memory
Feb 11 02:42:44 db1134 mysqld[3122]: Hope that's ok; if not, decrease some variables in the equation.
Feb 11 02:42:44 db1134 mysqld[3122]: Thread pointer: 0x7f5183925008
Feb 11 02:42:44 db1134 mysqld[3122]: Attempting backtrace. You can use the following information to find out
Feb 11 02:42:44 db1134 mysqld[3122]: where mysqld died. If you see no messages after this, something went
Feb 11 02:42:44 db1134 mysqld[3122]: terribly wrong...
Feb 11 02:42:44 db1134 mysqld[3122]: stack_bottom = 0x7f519e8cfbc8 thread_stack 0x30000
RLazarus renamed this task from db1134 placeholder task to Investigate and repool db1134.Thu, Feb 11, 3:21 AM
RLazarus triaged this task as High priority.
RLazarus added a project: DBA.
RLazarus updated the task description. (Show Details)
RLazarus added a subscriber: RLazarus.

racadm getsel

Record:      11
Date/Time:   02/11/2021 01:38:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   02/11/2021 01:38:44
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_B3.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   02/11/2021 02:42:43
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

<3 the response, you not only did exactly with manuel would have done (depool from traffic), you also discovered the core reason why mysql failed (hw memory errors). Thank you a lot!

Change 663525 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] icinga: Disable notifications for db1134 while under maintenance

https://gerrit.wikimedia.org/r/663525

Change 663525 merged by Jcrespo:
[operations/puppet@production] icinga: Disable notifications for db1134 while under maintenance

https://gerrit.wikimedia.org/r/663525

db1134 is likely to be unavailable for a long period of time due to T274472#6821332. It was the candidate master, which means we have to choose other one for that.

  • db1083
  • db2112 (and other codfw hosts) wrong datacenter
  • db1154:3311 (muliinstance)
  • db1140:3311 (multiinstance/backup source)
  • db1139:3311 (multiinstance/backup source)
  • db1133 (test host)
  • db1124:3311 (multiinstance)
  • db1106 (sanitarium's master/must be ROW)
  • db1105:3311 (muliinstance)
  • db1099:3311 (muliinstance)
  • clouddb1017:3311 (cloud db host)
  • clouddb1013:3311 (cloud db host)
  • db1134 (it is this failed host)

Potential candidates:

  • db1169: 10.4
  • db1135: 10.4
  • db1084: 10.4
  • db1119: 10.4
  • db1118: 10.4
$ mw-section-groups s1 eqiad
db1083                 0 <none>
db1084               200 api
db1099:3311           50 contributions,logpager,recentchanges,recentchangeslinked,watchlist
db1105:3311           50 contributions,logpager,recentchanges,recentchangeslinked,watchlist
db1106                50 dump,vslow
db1118               325 <none>
db1119               200 api
db1134               200 api
db1135               200 api
db1169               200 api

db1118 maybe?

db1118 it is, then, seems also the most reliable one of the list (based on no past crashes/longevity serving traffic).

Change 663531 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Promote db1118 as the new candidate master for s1

https://gerrit.wikimedia.org/r/663531

Change 663531 merged by Jcrespo:
[operations/puppet@production] mariadb: Promote db1118 as the new candidate master for s1

https://gerrit.wikimedia.org/r/663531

Mentioned in SAL (#wikimedia-operations) [2021-02-11T10:19:40Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db1118.eqiad.wmnet with reason: Depooling to change binglog_format T274472

Mentioned in SAL (#wikimedia-operations) [2021-02-11T10:19:45Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1118.eqiad.wmnet with reason: Depooling to change binglog_format T274472

Mentioned in SAL (#wikimedia-operations) [2021-02-11T10:34:40Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1118 (re)pooling @ 33%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14313 and previous config saved to /var/cache/conftool/dbconfig/20210211-103440-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-02-11T10:49:44Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1118 (re)pooling @ 66%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14314 and previous config saved to /var/cache/conftool/dbconfig/20210211-104943-kormat.json

Change 663549 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] db1163: Reimage to stretch to potentially become s1 candidate master

https://gerrit.wikimedia.org/r/663549

Mentioned in SAL (#wikimedia-operations) [2021-02-11T11:04:47Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14315 and previous config saved to /var/cache/conftool/dbconfig/20210211-110447-kormat.json

Change 663549 merged by Jcrespo:
[operations/puppet@production] db1163: Reimage to stretch to potentially become s1 candidate master

https://gerrit.wikimedia.org/r/663549

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

db1163.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102111126_jynus_20793_db1163_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['db1163.eqiad.wmnet']

and were ALL successful.

Change 663570 had a related patch set uploaded (by LSobanski; owner: LSobanski):
[operations/puppet@production] instances.yaml: Add db1163 to dbctl

https://gerrit.wikimedia.org/r/663570

Change 663570 merged by LSobanski:
[operations/puppet@production] instances.yaml: Add db1163 to dbctl

https://gerrit.wikimedia.org/r/663570

Change 663608 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reenable notifications for db1163 once it has been pooled

https://gerrit.wikimedia.org/r/663608

Change 663608 merged by Jcrespo:
[operations/puppet@production] mariadb: Reenable notifications for db1163 once it has been pooled

https://gerrit.wikimedia.org/r/663608

jcrespo moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
jcrespo added subscribers: Cmjohnson, Jclark-ctr.

Chris or John, please help us with this- Based on hw logs, it seems a typical case of memory stick going wrong.

Host is depooled from service and can be rebooted/serviced in any way. Even if that means uselessly moving sticks around upon vendor suggestion.

Note: I said uselessly because vendors sometimes ask for debugging that are not very useful- YOUR help is always very helpful.

wiki_willy added a subscriber: wiki_willy.

@Cmjohnson /@Jclark-ctr - just a heads up, this is higher priority and the server is still under warranty, through November 2021. Thanks, Willy

pasting system event log

Record: 10
Date/Time: 02/10/2021 23:35:50
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B3.

Record: 11
Date/Time: 02/11/2021 01:38:37
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B3.

Record: 12
Date/Time: 02/11/2021 01:38:44
Source: system
Severity: Critical

Description: Correctable memory error logging disabled for a memory device at location DIMM_B3.

Record: 13
Date/Time: 02/11/2021 02:42:43
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

a ticket has been created with Dell for a new DIMM.

Ticket number SR1051489398

Moving over to @Jclark-ctr to receive and replace the memory, since @Cmjohnson is out on vacation next week. Thanks, Willy

Thanks everyone who responded to this incident!

Change 664230 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1134: Do not be tag as candidate master

https://gerrit.wikimedia.org/r/664230

Change 664230 merged by Marostegui:
[operations/puppet@production] db1134: Do not be tag as candidate master

https://gerrit.wikimedia.org/r/664230

@jcrespo Is host still down can maintenance be preformed on host?

Yes, you can proceed anytime
Thanks

Thanks John - I have powered up the host and it looks good now. I will take it from here

root@db1134:~# free -g
              total        used        free      shared  buff/cache   available
Mem:            502           0         501           0           0         499
Swap:             7           0           7

I am going to close this - the task to track next steps is: T275343
Thanks everyone for helping out here

Mentioned in SAL (#wikimedia-operations) [2021-03-04T11:10:20Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Needs fixing after T274472

Mentioned in SAL (#wikimedia-operations) [2021-03-04T11:10:28Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Needs fixing after T274472