db1134 paged for high replication lag:
22:07:54 <+icinga-wm> PROBLEM - MariaDB Replica Lag: s1 #page on db1134 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1468.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
We depooled it: !log rzl@cumin1001 dbctl commit (dc=all): 'depool db1134', diff saved to https://phabricator.wikimedia.org/P14310 and previous config saved to /var/cache/conftool/dbconfig/20210211-031048-rzl.json
As of this writing, still digging into why, although the graphs suggest whatever happened happened at 2245: https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1134&var-port=9104&from=1613006340000&to=1613013540000
And @colewhite notes Feb 11 02:42:44 db1134 mysqld[3122]: 210211 2:42:44 [ERROR] mysqld got signal 7, and that the syslog indicates memory corruption.
DBA, I'm leaving it depooled overnight for you to investigate during your working hours. <3
Logs:
Feb 11 02:42:44 db1134 kernel: [1694159.917831] {1}Hardware error detected on CPU12 Feb 11 02:42:44 db1134 kernel: [1694159.917842] {1}event severity: recoverable Feb 11 02:42:44 db1134 kernel: [1694159.917843] {1} Error 0, type: recoverable Feb 11 02:42:44 db1134 kernel: [1694159.917844] {1} fru_text: B3 Feb 11 02:42:44 db1134 kernel: [1694159.917844] {1} section_type: memory error Feb 11 02:42:44 db1134 kernel: [1694159.917845] {1} error_status: 0x0000000000000400 Feb 11 02:42:44 db1134 kernel: [1694159.917846] {1} physical_address: 0x0000006901926840 Feb 11 02:42:44 db1134 kernel: [1694159.917848] {1} node: 2 card: 2 module: 0 rank: 1 bank: 0 row: 675 column: 0 Feb 11 02:42:44 db1134 kernel: [1694159.917850] {1} DIMM location: not present. DMI handle: 0x0000 Feb 11 02:42:44 db1134 kernel: [1694159.919460] Memory failure: 0x6901926: Killing mysqld:3122 due to hardware memory corruption Feb 11 02:42:44 db1134 kernel: [1694159.928058] Memory failure: 0x6901926: recovery action for dirty LRU page: Recovered Feb 11 02:42:44 db1134 mysqld[3122]: 210211 2:42:44 [ERROR] mysqld got signal 7 ;
racadm getsel Record: 11 Date/Time: 02/11/2021 01:38:37 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B3. ------------------------------------------------------------------------------- Record: 12 Date/Time: 02/11/2021 01:38:44 Source: system Severity: Critical Description: Correctable memory error logging disabled for a memory device at location DIMM_B3. ------------------------------------------------------------------------------- Record: 13 Date/Time: 02/11/2021 02:42:43 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.