20:27:17 <+icinga-wm> PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100% 20:35:53 <+icinga-wm> PROBLEM - MariaDB Replica IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave 20:37:04 <+logmsgbot> !log cdanis@cumin1001 dbctl commit (dc=all): 'depool db1082, it crashed', diff saved to https://phabricator.wikimedia.org/P11951 and previous config saved to /var/cache/conftool/dbconfig/20200718-203704-cdanis.json
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
db1082: Clarify BBU status | operations/puppet | production | +1 -0 | |
db1082: Disable notifications | operations/puppet | production | +1 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Unknown Object (Task) | |||||
Resolved | • Marostegui | T258361 Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) | |||
Declined | None | T258386 db1080-95 batch possibly suffering BBU issues | |||
Resolved | • Marostegui | T258336 db1082 crashed | |||
Resolved | Jclark-ctr | T258910 Remove db1082's BBU on-site |
Event Timeline
Change 614292 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1082: Disable notifications
Host rebooted itself:
root@db1082:~# uptime 20:45:47 up 6 min, 1 user, load average: 0.00, 0.17, 0.13
Cause yet to be investigated with HW logs and so on
Change 614292 merged by Marostegui:
[operations/puppet@production] db1082: Disable notifications
hpiLO related logs here below:
</>hpiLO-> show /system1/log1/record17 status=0 status_tag=COMMAND COMPLETED Sat Jul 18 20:50:11 2020 /system1/log1/record17 Targets Properties number=17 severity=Critical date=07/18/2020 time=20:37 description=ASR Detected by System ROM Verbs cd version exit show </>hpiLO-> show /system1/log1/record16 status=0 status_tag=COMMAND COMPLETED Sat Jul 18 20:50:18 2020 /system1/log1/record16 Targets Properties number=16 severity=Caution date=07/18/2020 time=20:19 description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1, service information: 0x07). Action: 1. Remove additional devices. 2. Consult server troubleshooting guide. 3. Gather AHS log and contact Support Verbs cd version exit show
Mentioned in SAL (#wikimedia-operations) [2020-07-18T20:52:50Z] <marostegui> Due to db1082 crash there will be replication lag on s5 on labsdb hosts - T258336
Mentioned in SAL (#wikimedia-operations) [2020-07-19T18:51:41Z] <marostegui> Upgrade and reboot db1082 T258336
Mentioned in SAL (#wikimedia-operations) [2020-07-19T18:57:31Z] <marostegui> Start mysql on db1082 T258336
I have rebooted the host, the BBU looks ok.
MySQL started ok as well, did the recovery finely too.
Replication is started.
PST UTC Lag Minutes LagMinutes Lag Decay Rate/minute 12:30 19:30 17:43 12:56 19:56 12:44 0:26 299 11.50 13:42 20:42 5:38 0:46 426 9.26 13:52 20:52 4:31 0:10 67 6.70 14:08 21:08 1:20 0:16 191 11.94
Mentioned in SAL (#wikimedia-operations) [2020-07-20T05:18:47Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11953 and previous config saved to /var/cache/conftool/dbconfig/20200720-051846-marostegui.json
Mentioned in SAL (#wikimedia-operations) [2020-07-20T05:39:24Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11955 and previous config saved to /var/cache/conftool/dbconfig/20200720-053816-marostegui.json
Mentioned in SAL (#wikimedia-operations) [2020-07-20T05:47:47Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11956 and previous config saved to /var/cache/conftool/dbconfig/20200720-054747-marostegui.json
Mentioned in SAL (#wikimedia-operations) [2020-07-20T05:56:15Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db1082 after a crash T258336', diff saved to https://phabricator.wikimedia.org/P11957 and previous config saved to /var/cache/conftool/dbconfig/20200720-055614-marostegui.json
db1082 is fully repooled. I am going to consider this resolved.
There is not much else we can do really - this host will be replaced and refreshed in Q2 (T258336), it is quite old and of course out of warranty.
We might need to discuss if we want to accelerate that given the last recent crashes.
In UK time (UTC+1) today, it went down for 13 mins:
09:55:58 <icinga-wm> PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100%
10:03:58 <icinga-wm> PROBLEM - MariaDB Replica IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
10:06:41 <icinga-wm> RECOVERY - Host db1082 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
Mentioned in SAL (#wikimedia-operations) [2020-07-25T09:16:16Z] <oblivian@cumin1001> dbctl commit (dc=all): 'Depool db1082 T258336', diff saved to https://phabricator.wikimedia.org/P12040 and previous config saved to /var/cache/conftool/dbconfig/20200725-091616-oblivian.json
/system1/log1/record18 Targets Properties number=18 severity=Caution date=07/25/2020 time=08:53 description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1, service information: 0x07). Action: 1. Remove additional devices. 2. Consult server troubleshooting guide. 3. Gather AHS log and contact Support Verbs cd version exit show
RElated hw logs for this new crash looks the same of last week:
</>hpiLO-> show /system1/log1/record19 status=0 status_tag=COMMAND COMPLETED Sat Jul 25 09:22:56 2020 /system1/log1/record19 Targets Properties number=19 severity=Critical date=07/25/2020 time=09:04 description=ASR Detected by System ROM Verbs cd version exit show </>hpiLO-> show /system1/log1/record18 status=0 status_tag=COMMAND COMPLETED Sat Jul 25 09:22:58 2020 /system1/log1/record18 Targets Properties number=18 severity=Caution date=07/25/2020 time=08:53 description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1, service information: 0x07). Action: 1. Remove additional devices. 2. Consult server troubleshooting guide. 3. Gather AHS log and contact Support Verbs cd version exit show
For reference mysqld start logs:
Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] /opt/wmf-mariadb101/bin/mysqld (mysqld 10.1.44-MariaDB) starting as process 3356 ... Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used; This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=db1082' or '--log-bin=db1082-bin' to avoid this problem. Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [ERROR] Plugin 'rpl_semi_sync_slave' already installed Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: Using mutexes to ref count buffer pool pages Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: The InnoDB memory heap is disabled Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: Compressed tables use zlib 1.2.11 Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: Using Linux native AIO Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: Using SSE crc32 instructions Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] InnoDB: Initializing buffer pool, size = 378.0G Jul 25 09:17:50 db1082 prometheus-mysqld-exporter[593]: time="2020-07-25T09:17:50Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:268" Jul 25 09:17:50 db1082 prometheus-mysqld-exporter[593]: time="2020-07-25T09:17:50Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:268" Jul 25 09:17:52 db1082 mysqld[3356]: 2020-07-25 9:17:52 139825736157440 [Note] InnoDB: Completed initialization of buffer pool Jul 25 09:17:56 db1082 mysqld[3356]: 2020-07-25 9:17:56 139825736157440 [Note] InnoDB: Highest supported file format is Barracuda. Jul 25 09:17:56 db1082 mysqld[3356]: 2020-07-25 9:17:56 139825736157440 [Note] InnoDB: Starting crash recovery from checkpoint LSN=21774247445863 Jul 25 09:17:56 db1082 mysqld[3356]: 2020-07-25 9:17:56 139825736157440 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer... Jul 25 09:18:04 db1082 mysqld[3356]: 2020-07-25 9:18:04 139825736157440 [Note] InnoDB: Starting final batch to recover 89050 pages from redo log Jul 25 09:18:11 db1082 mysqld[3356]: 2020-07-25 9:18:11 139390777009920 [Note] InnoDB: To recover: 2577 pages from log Jul 25 09:18:15 db1082 mysqld[3356]: InnoDB: Last MySQL binlog file position 0 554389490, file name ./db1082-bin.000931 Jul 25 09:18:15 db1082 mysqld[3356]: 2020-07-25 9:18:15 139825736157440 [Note] InnoDB: 128 rollback segment(s) are active. Jul 25 09:18:15 db1082 mysqld[3356]: 2020-07-25 9:18:15 139825736157440 [Note] InnoDB: Waiting for purge to start Jul 25 09:18:15 db1082 mysqld[3356]: 2020-07-25 9:18:15 139825736157440 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.46-86.2 started; log sequence number 21774972892176 Jul 25 09:18:19 db1082 mysqld[3356]: 2020-07-25 9:18:19 139382581331712 [Note] InnoDB: Dumping buffer pool(s) not yet started Jul 25 09:18:19 db1082 mysqld[3356]: 2020-07-25 09:18:19 7ec4893f8700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool Jul 25 09:18:19 db1082 mysqld[3356]: 2020-07-25 9:18:19 139825736157440 [Note] Recovering after a crash using db1082-bin Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [Note] Starting crash recovery... Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [Note] Crash recovery finished. Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [Note] Server socket created on IP: '::'. Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [Note] Server socket created on IP: '::'. Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [Warning] Checking table: './mysql/event' Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [ERROR] mysql.event: 1 client is using or hasn't closed the table properly Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825735550720 [Note] Event Scheduler: scheduler thread started with id 2 Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--log-basename=#' or '--relay-log=db1082-relay-bin' to avoid this problem. Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825735141120 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@db1100.eqiad.wmnet:3306' in log 'db1100-bin.002560' at position 4 Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [Note] /opt/wmf-mariadb101/bin/mysqld: ready for connections. Jul 25 09:18:22 db1082 mysqld[3356]: Version: '10.1.44-MariaDB' socket: '/run/mysqld/mysqld.sock' port: 3306 MariaDB Server Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825735141120 [Note] Slave I/O thread: connected to master 'repl@db1100.eqiad.wmnet:3306',replication starts at GTID position '171974884-171974884-1473084269,0-180359179-5734605861,171974853-171974853-973629796,171970704-171970704-351094624,180363367-180363367-133158799,171978768-171978768-202416,171978777-171978777-2596176831,180367364-180367364-67917352,180359179-180359179-96523837' Jul 25 09:19:02 db1082 mysqld[3356]: 2020-07-25 9:19:02 139398398473984 [Note] Start asynchronous binlog_dump to slave (server_id: 171970577), pos(./db1082-bin.000932, 4)
Note in particular:
Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired Jul 25 09:18:22 db1082 mysqld[3356]: 2020-07-25 9:18:22 139825736157440 [ERROR] mysql.event: 1 client is using or hasn't closed the table properly
Interestingly the OS still sees the BBU (which is something we've seen sometimes too).
Battery/Capacitor Count: 1
This means that the above crash can still happen again. It would be better if the BBU would be marked as failed. Maybe we can remove it onsite.
Anyways, I will try to see if we can accelerate the retirement of the batch of hosts (T258386) that are having BBU issues which was scheduled for Q2 (T258361)
Mentioned in SAL (#wikimedia-operations) [2020-07-25T12:41:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1096:3315 into s5 api afte db1082 crashed T258336', diff saved to https://phabricator.wikimedia.org/P12041 and previous config saved to /var/cache/conftool/dbconfig/20200725-124104-marostegui.json
Change 616649 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1082: Clarify BBU status
Change 616649 merged by Marostegui:
[operations/puppet@production] db1082: Clarify BBU status
This host has been fully repooled.
The BBU is now gone T258910: Remove db1082's BBU on-site so this same crash shouldn't happen again.
This host will be decommissioned in Q2 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers)