Page MenuHomePhabricator

db1034 was killed 22-05-16 at 14:17:06
Closed, ResolvedPublic

Event Timeline

jcrespo renamed this task from db1034 was willed today at 14:17:06 to db1034 was killed today at 14:17:06.May 22 2016, 2:29 PM
jcrespo added a project: DBA.
160521 23:55:33 [ERROR] Event Scheduler: [root@localhost][ops.wmf_slave_wikiuser_slow] Unknown thread id: 2084456573
160521 23:55:33 [Note] Event Scheduler: [root@localhost].[ops.wmf_slave_wikiuser_slow] event execution failed.
160522 11:38:35 [ERROR] Event Scheduler: [root@localhost][ops.wmf_slave_wikiuser_sleep] Unknown thread id: 2103706958
160522 11:38:35 [Note] Event Scheduler: [root@localhost].[ops.wmf_slave_wikiuser_sleep] event execution failed.
Killed
160522 14:17:06 mysqld_safe Number of processes running now: 0
160522 14:17:06 mysqld_safe mysqld restarted
160522 14:17:06 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=db1034' or '--log-bin=db1034-bin' to avoid this problem.
160522 14:17:06 [Note] InnoDB: Using mutexes to ref count buffer pool pages
160522 14:17:06 [Note] InnoDB: The InnoDB memory heap is disabled
160522 14:17:06 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
160522 14:17:06 [Note] InnoDB: Memory barrier is not used
160522 14:17:06 [Note] InnoDB: Compressed tables use zlib 1.2.7
160522 14:17:06 [Note] InnoDB: Using CPU crc32 instructions
160522 14:17:06 [Note] InnoDB: Initializing buffer pool, size = 47.0G
160522 14:17:08 [Note] InnoDB: Completed initialization of buffer pool
160522 14:17:09 [Note] InnoDB: Highest supported file format is Barracuda.
160522 14:17:09 [Note] InnoDB: Log scan progressed past the checkpoint lsn 6946172412995
160522 14:17:09 [Note] InnoDB: Database was not shutdown normally!
160522 14:17:09 [Note] InnoDB: Starting crash recovery.
160522 14:17:09 [Note] InnoDB: Reading tablespace information from the .ibd files...
160522 14:17:19 [Note] InnoDB: Restoring possible half-written data pages 
160522 14:17:19 [Note] InnoDB: from the doublewrite buffer...
InnoDB: Doing recovery: scanned up to log sequence number 6946177655808

mysql process was killed by the OOM after the server fill it's swap and there was no more memory (swappiness is zero). The server was having it's swap almost full since May 20th at 11:47:

May 22 14:17:01 db1034 kernel: [28209888.201248] Out of memory: Kill process 18619 (mysqld) score 843 or sacrifice child
May 22 14:17:01 db1034 kernel: [28209888.209173] Killed process 18619 (mysqld) total-vm:63090356kB, anon-rss:55561420kB, file-rss:0kB

From logs the only other activity around that time was a Puppet run that started applying the status just 7 seconds before and lasted 45.5s:

May 22 14:16:54 db1034 puppet-agent[17536]: Applying configuration version '1463926544'
May 22 14:17:36 db1034 puppet-agent[17536]: Finished catalog run in 44.45 seconds

I've run Puppet and saw that it's memory usage is around 55MB at the start and raise to 128MB of RAM after that logging messag with a peak of 160MB.
Surely negligible on a host with 64GB of RAM but might have been the last straw in this case.

No peak of slow queries from tendril before the kill, just an obvious increase just after the restart.

Based on how soon we plan to decommission this server we might want to adjust the memory configuration lowering the innodb_buffer_pool or max connections to be a bit more on the safe side and fix also T135353.
From tendril is not clear if the peak in connections is also before it was killed, the data is already aggregated. From a theoretical calculus the maximum memory usage of mysql is ~65GB with 5000 connections, the host has 64GB of RAM, but is unlikely we were around max connections at that time.

The swap usage was 0 until May 17th at 11:17 and was full since May 20th at 11:47 see here:

Screen Shot 2016-05-23 at 11.36.00.png (2,538×1,336 px, 212 KB)

All puppet runs are around :16 and :46 and align with most of the drops in the graph for free swap.

Luke081515 renamed this task from db1034 was killed today at 14:17:06 to db1034 was killed 22-05-16 at 14:17:06.May 23 2016, 2:24 PM
jcrespo claimed this task.