In the morning and all of a sudden dbstore2001 was in too many connections: https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&from=1494418663213&to=1494505063213&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001
Trying a graceful stop didn't work so I had to kill the mysqld process. It recovered fine, but it never showed up MySQL ready for connections, after around 40 minutes.
Going thru the logs of past issues, I saw it normally takes around 40 minutes to do so after InnoDB has recovered.
I left 1 hour, and nothing happened.
I proceeded to kill -9 both, mysqld and msyqld_safe process for a completely new fresh start.
InnoDB started recovery and it finished at:
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: In a MySQL replication slave the last master binlog file InnoDB: position 0 1022925931, file name db1038-bin.000776 InnoDB: Last MySQL binlog file position 0 608459318, file name ./db2057-bin.001131 170511 11:05:19 [Note] InnoDB: 128 rollback segment(s) are active. 170511 11:05:19 [Note] InnoDB: Waiting for purge to start 170511 11:05:19 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.32-79.0 started; log sequence number 46742343284199 2017-05-11 11:05:19 7fce45bf9700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool 170511 11:05:20 [Note] Plugin 'FEEDBACK' is disabled. 170511 11:05:20 [ERROR] Plugin 'unix_socket' already installed 170511 11:05:20 [Note] CONNECT: Version 1.04.0008 August 10, 2016 170511 11:05:20 [Note] Server socket created on IP: '::'. 170511 11:05:20 [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired 170511 11:05:20 [Warning] Checking table: './mysql/event' 170511 11:05:20 [ERROR] mysql.event: 1 client is using or hasn't closed the table properly 170511 11:05:20 [Note] Event Scheduler: scheduler thread started with id 1 170511 11:05:31 [ERROR] Event Scheduler: [root@localhost][ops.wmf_dbstore_delay] get_lock
That is around 1 hour ago, and I still don't have access to MySQL. I am going to leave it there for hours to see how long it takes.
I believe it is still trying to load the buffer pool, because it is actually doing stuff (see disk util):
root@dbstore2001:~# iostat -x 1 5 Linux 4.4.0-3-amd64 (dbstore2001) 05/11/2017 _x86_64_ (32 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2.98 0.00 0.26 9.44 0.00 87.32 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 18.86 314.25 577.29 5371.20 21109.10 59.40 0.25 0.28 0.71 0.04 0.18 16.47 dm-0 0.00 0.00 314.03 579.93 5367.03 21017.42 59.03 0.02 0.02 0.71 0.12 0.18 16.47 avg-cpu: %user %nice %system %iowait %steal %idle 0.37 0.00 0.03 8.78 0.00 90.82 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 442.00 484.00 6480.00 13352.00 42.83 8.08 15.50 32.43 0.04 1.08 100.00 dm-0 0.00 0.00 426.00 484.00 6296.00 13352.00 43.18 8.08 15.78 33.65 0.04 1.10 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.81 0.00 0.16 8.16 0.00 90.87 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 7.00 1273.00 1158.00 16000.00 30884.00 38.57 88.22 35.32 67.40 0.05 0.41 100.00 dm-0 0.00 0.00 1292.00 1155.00 16304.00 30844.00 38.54 89.44 35.63 67.13 0.39 0.41 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.53 0.00 0.13 4.04 0.00 95.30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 721.00 802.00 9040.00 21608.00 40.25 61.27 33.24 70.15 0.05 0.66 100.00 dm-0 0.00 0.00 843.00 802.00 10448.00 21608.00 38.97 61.19 30.97 60.38 0.05 0.61 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.41 0.00 0.06 7.47 0.00 92.06 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 557.00 848.00 6520.00 22700.00 41.59 41.64 38.33 96.62 0.05 0.71 100.00 dm-0 0.00 0.00 416.00 848.00 4808.00 22700.00 43.53 42.05 42.99 130.18 0.22 0.79 100.00
Storage and BBU look good. There are no dmesg errors os RAID issues or anything.
Same goes for the idrac, no logs with any HW issues
We will see how long it takes...
UPDATE as of 12th May 2017
It looks like it doesn't really matter whether it is a crash or a normal stop+start, it takes around 3 hours to get it fully up.
I believe we have touched performance limitations with so many shards on a single instance and not having SSDs.
All the details about the normal stop + start at: T165033#3257985