Related Objects
Event Timeline
160521 23:55:33 [ERROR] Event Scheduler: [root@localhost][ops.wmf_slave_wikiuser_slow] Unknown thread id: 2084456573 160521 23:55:33 [Note] Event Scheduler: [root@localhost].[ops.wmf_slave_wikiuser_slow] event execution failed. 160522 11:38:35 [ERROR] Event Scheduler: [root@localhost][ops.wmf_slave_wikiuser_sleep] Unknown thread id: 2103706958 160522 11:38:35 [Note] Event Scheduler: [root@localhost].[ops.wmf_slave_wikiuser_sleep] event execution failed. Killed 160522 14:17:06 mysqld_safe Number of processes running now: 0 160522 14:17:06 mysqld_safe mysqld restarted 160522 14:17:06 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used; This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=db1034' or '--log-bin=db1034-bin' to avoid this problem. 160522 14:17:06 [Note] InnoDB: Using mutexes to ref count buffer pool pages 160522 14:17:06 [Note] InnoDB: The InnoDB memory heap is disabled 160522 14:17:06 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 160522 14:17:06 [Note] InnoDB: Memory barrier is not used 160522 14:17:06 [Note] InnoDB: Compressed tables use zlib 1.2.7 160522 14:17:06 [Note] InnoDB: Using CPU crc32 instructions 160522 14:17:06 [Note] InnoDB: Initializing buffer pool, size = 47.0G 160522 14:17:08 [Note] InnoDB: Completed initialization of buffer pool 160522 14:17:09 [Note] InnoDB: Highest supported file format is Barracuda. 160522 14:17:09 [Note] InnoDB: Log scan progressed past the checkpoint lsn 6946172412995 160522 14:17:09 [Note] InnoDB: Database was not shutdown normally! 160522 14:17:09 [Note] InnoDB: Starting crash recovery. 160522 14:17:09 [Note] InnoDB: Reading tablespace information from the .ibd files... 160522 14:17:19 [Note] InnoDB: Restoring possible half-written data pages 160522 14:17:19 [Note] InnoDB: from the doublewrite buffer... InnoDB: Doing recovery: scanned up to log sequence number 6946177655808
mysql process was killed by the OOM after the server fill it's swap and there was no more memory (swappiness is zero). The server was having it's swap almost full since May 20th at 11:47:
May 22 14:17:01 db1034 kernel: [28209888.201248] Out of memory: Kill process 18619 (mysqld) score 843 or sacrifice child May 22 14:17:01 db1034 kernel: [28209888.209173] Killed process 18619 (mysqld) total-vm:63090356kB, anon-rss:55561420kB, file-rss:0kB
From logs the only other activity around that time was a Puppet run that started applying the status just 7 seconds before and lasted 45.5s:
May 22 14:16:54 db1034 puppet-agent[17536]: Applying configuration version '1463926544' May 22 14:17:36 db1034 puppet-agent[17536]: Finished catalog run in 44.45 seconds
I've run Puppet and saw that it's memory usage is around 55MB at the start and raise to 128MB of RAM after that logging messag with a peak of 160MB.
Surely negligible on a host with 64GB of RAM but might have been the last straw in this case.
No peak of slow queries from tendril before the kill, just an obvious increase just after the restart.
Based on how soon we plan to decommission this server we might want to adjust the memory configuration lowering the innodb_buffer_pool or max connections to be a bit more on the safe side and fix also T135353.
From tendril is not clear if the peak in connections is also before it was killed, the data is already aggregated. From a theoretical calculus the maximum memory usage of mysql is ~65GB with 5000 connections, the host has 64GB of RAM, but is unlikely we were around max connections at that time.
The swap usage was 0 until May 17th at 11:17 and was full since May 20th at 11:47 see here:
All puppet runs are around :16 and :46 and align with most of the drops in the graph for free swap.
