Page MenuHomePhabricator

dbstore2001 takes 3 hours to start MySQL (was: dbstore2001 takes 3 hours to start MySQL after a crash)
Closed, ResolvedPublic

Description

In the morning and all of a sudden dbstore2001 was in too many connections: https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&from=1494418663213&to=1494505063213&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001

Trying a graceful stop didn't work so I had to kill the mysqld process. It recovered fine, but it never showed up MySQL ready for connections, after around 40 minutes.
Going thru the logs of past issues, I saw it normally takes around 40 minutes to do so after InnoDB has recovered.

I left 1 hour, and nothing happened.

I proceeded to kill -9 both, mysqld and msyqld_safe process for a completely new fresh start.
InnoDB started recovery and it finished at:

InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 1022925931, file name db1038-bin.000776
InnoDB: Last MySQL binlog file position 0 608459318, file name ./db2057-bin.001131
170511 11:05:19 [Note] InnoDB: 128 rollback segment(s) are active.
170511 11:05:19 [Note] InnoDB: Waiting for purge to start
170511 11:05:19 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.32-79.0 started; log sequence number 46742343284199
2017-05-11 11:05:19 7fce45bf9700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
170511 11:05:20 [Note] Plugin 'FEEDBACK' is disabled.
170511 11:05:20 [ERROR] Plugin 'unix_socket' already installed
170511 11:05:20 [Note] CONNECT: Version 1.04.0008 August 10, 2016
170511 11:05:20 [Note] Server socket created on IP: '::'.
170511 11:05:20 [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired
170511 11:05:20 [Warning] Checking table:   './mysql/event'
170511 11:05:20 [ERROR] mysql.event: 1 client is using or hasn't closed the table properly
170511 11:05:20 [Note] Event Scheduler: scheduler thread started with id 1
170511 11:05:31 [ERROR] Event Scheduler: [root@localhost][ops.wmf_dbstore_delay] get_lock

That is around 1 hour ago, and I still don't have access to MySQL. I am going to leave it there for hours to see how long it takes.
I believe it is still trying to load the buffer pool, because it is actually doing stuff (see disk util):

root@dbstore2001:~# iostat -x 1 5
Linux 4.4.0-3-amd64 (dbstore2001) 	05/11/2017 	_x86_64_	(32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.98    0.00    0.26    9.44    0.00   87.32

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    18.86  314.25  577.29  5371.20 21109.10    59.40     0.25    0.28    0.71    0.04   0.18  16.47
dm-0              0.00     0.00  314.03  579.93  5367.03 21017.42    59.03     0.02    0.02    0.71    0.12   0.18  16.47

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.37    0.00    0.03    8.78    0.00   90.82

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  442.00  484.00  6480.00 13352.00    42.83     8.08   15.50   32.43    0.04   1.08 100.00
dm-0              0.00     0.00  426.00  484.00  6296.00 13352.00    43.18     8.08   15.78   33.65    0.04   1.10 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.81    0.00    0.16    8.16    0.00   90.87

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     7.00 1273.00 1158.00 16000.00 30884.00    38.57    88.22   35.32   67.40    0.05   0.41 100.00
dm-0              0.00     0.00 1292.00 1155.00 16304.00 30844.00    38.54    89.44   35.63   67.13    0.39   0.41 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.53    0.00    0.13    4.04    0.00   95.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  721.00  802.00  9040.00 21608.00    40.25    61.27   33.24   70.15    0.05   0.66 100.00
dm-0              0.00     0.00  843.00  802.00 10448.00 21608.00    38.97    61.19   30.97   60.38    0.05   0.61 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.41    0.00    0.06    7.47    0.00   92.06

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  557.00  848.00  6520.00 22700.00    41.59    41.64   38.33   96.62    0.05   0.71 100.00
dm-0              0.00     0.00  416.00  848.00  4808.00 22700.00    43.53    42.05   42.99  130.18    0.22   0.79 100.00

Storage and BBU look good. There are no dmesg errors os RAID issues or anything.
Same goes for the idrac, no logs with any HW issues

We will see how long it takes...

UPDATE as of 12th May 2017
It looks like it doesn't really matter whether it is a crash or a normal stop+start, it takes around 3 hours to get it fully up.
I believe we have touched performance limitations with so many shards on a single instance and not having SSDs.
All the details about the normal stop + start at: T165033#3257985

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 11 2017, 12:22 PM
Marostegui moved this task from Triage to In progress on the DBA board.May 11 2017, 12:22 PM
Marostegui updated the task description. (Show Details)May 11 2017, 12:25 PM

After a while the event scheduler started to complain (I still don't have access to MySQL):

OpenTable: (2002) Can't connect to local MySQL server through socket '/tmp/mysql.sock' (11 "Resource temporarily unavailable")
170511 12:53:10 [ERROR] Event Scheduler: [root@localhost][ops.wmf_dbstore_delay] Got error 174 '(2002) Can't connect to local MySQL server through socket '/tmp/mysql.sock' (11 "Resource temporarily unavailable")' from CONNECT
170511 12:53:10 [Note] Event Scheduler: [root@localhost].[ops.wmf_dbstore_delay] event execution failed.

So what I have done is:

  • Attempt to stop MySQL - no luck
  • Killed it and started it with the event scheduler disabled, to see if that helps the server to come back up faster

In the end MySQL started fine after almost 3 hours after finishing the InnoDB recovery:

170511 13:17:59 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 1022925931, file name db1038-bin.000776
InnoDB: Last MySQL binlog file position 0 608459318, file name ./db2057-bin.001131
170511 13:44:03 [Note] InnoDB: 128 rollback segment(s) are active.
170511 13:44:03 [Note] InnoDB: Waiting for purge to start
170511 13:44:03 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.32-79.0 started; log sequence number 46743235607451
2017-05-11 13:44:03 7f1c617f9700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
170511 13:44:03 [Note] Plugin 'FEEDBACK' is disabled.
170511 13:44:03 [ERROR] Plugin 'unix_socket' already installed
170511 13:44:03 [Note] CONNECT: Version 1.04.0008 August 10, 2016
170511 13:44:03 [Note] Server socket created on IP: '::'.
170511 13:44:04 [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired
170511 13:44:04 [Warning] Checking table:   './mysql/event'
170511 13:44:04 [ERROR] mysql.event: 1 client is using or hasn't closed the table properly
2017-05-11 16:28:09 7f1c617f9700 InnoDB: Buffer pool(s) load completed at 170511 16:28:09
170511 16:59:21 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--log-basename=#' or '--relay-log=dbstore2001-relay-bin' to avoid this problem.
170511 16:59:21 [Note] /opt/wmf-mariadb10/bin/mysqld: ready for connections.
Version: '10.0.28-MariaDB'  socket: '/tmp/mysql.sock'  port: 3306  MariaDB Server
Marostegui renamed this task from dbstore2001 not in a healthy status to dbstore2001 takes 3 hours to start MySQL after a crash.May 11 2017, 5:52 PM
Marostegui triaged this task as Medium priority.

I have started all the slaves withouth any issues.
Also, manually started event scheduler again.

I will test tomorrow a normal stop and a normal start to see how long it takes.

Mentioned in SAL (#wikimedia-operations) [2017-05-12T05:53:55Z] <marostegui> Stop MySQL dbstore2001 for testing - T165033

Stop process - 6 minutes:

170512  5:55:05 [Note] /opt/wmf-mariadb10/bin/mysqld: Normal shutdown
<snip>
170512  6:01:23 [Note] /opt/wmf-mariadb10/bin/mysqld: Shutdown complete

170512 06:01:23 mysqld_safe mysqld from pid file /srv/sqldata/dbstore2001.pid ended

Start process:

170512 06:03:14 mysqld_safe Starting mysqld daemon with databases from /srv/sqldata
170512  6:03:15 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.28-MariaDB) starting as process 63685 ...
<snip>
170512  6:03:30 [Note] Server socket created on IP: '::'.
170512  6:03:31 [Note] Event Scheduler: scheduler thread started with id 1
170512  6:03:41 [ERROR] Event Scheduler: [root@localhost][ops.wmf_dbstore_delay] get_lock
170512  6:03:41 [Note] Event Scheduler: [root@localhost].[ops.wmf_dbstore_delay] event execution failed.
170512  6:03:46 [ERROR] Event Scheduler: [root@localhost][ops.wmf_dbstore_delay] get_lock
<snip>

After 1:40h we got this:

170512  7:43:35 [ERROR] Event Scheduler: [root@localhost][ops.wmf_dbstore_delay] Got error 174 '(2013) Lost connection to MySQL server at 'waiting for initial communication packet', system error: 110 "Connection timed out"' from CONNECT
170512  7:43:35 [Note] Event Scheduler: [root@localhost].[ops.wmf_dbstore_delay] event execution failed.
<snip>

And finally after around 3:30h MySQL is up and ready:

170512  9:33:18 [Note] /opt/wmf-mariadb10/bin/mysqld: ready for connections.
Version: '10.0.28-MariaDB'  socket: '/tmp/mysql.sock'  port: 3306  MariaDB Server
Marostegui renamed this task from dbstore2001 takes 3 hours to start MySQL after a crash to dbstore2001 takes 3 hours to start MySQL (was: dbstore2001 takes 3 hours to start MySQL after a crash).EditedMay 12 2017, 9:41 AM
Marostegui closed this task as Resolved.
Marostegui claimed this task.
Marostegui updated the task description. (Show Details)

I am going to resolve this ticket now as the investigation is over and there is not much we can do at this point I believe.
At least we now have it tracked for future issues.

It took now 8:30h to get a normal start: T168354#3363993