According to the action log on Horizon I've had to force reboot deployment-cache-text06 three times since April this year. As far as I remember, all have been similar cases where the instance continues to serve traffic but is refusing SSH connections into it and shows up as a Puppet failure on https://vpsalertmanager.toolforge.org/?q=project%3Ddeployment-prep. This time the logs look like this:
[3210793.683289] Out of memory: Kill process 17901 ([TS_MAIN]) score 449 or sacrifice child [3210793.684360] Killed process 17915 (traffic_crashlo) total-vm:15904kB, anon-rss:980kB, file-rss:0kB, shmem-rss:0kB [3210798.430599] Out of memory: Kill process 17901 ([TS_MAIN]) score 449 or sacrifice child [3210798.432936] Killed process 17901 ([TS_MAIN]) total-vm:2929664kB, anon-rss:1814184kB, file-rss:0kB, shmem-rss:0kB [5362978.986816] INFO: task jbd2/vda2-8:192 blocked for more than 120 seconds. [5362978.989405] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5362978.991796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5362978.995270] INFO: task [ET_TASK 1]:23757 blocked for more than 120 seconds. [5362978.997892] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5362978.999709] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5362979.001889] INFO: task mtail:8027 blocked for more than 120 seconds. [5362979.003488] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5362979.005089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5362979.007120] INFO: task mtail:22308 blocked for more than 120 seconds. [5362979.008731] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5362979.009726] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5362979.011053] INFO: task mtail:13827 blocked for more than 120 seconds. [5362979.012115] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5362979.013138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5363099.809121] INFO: task jbd2/vda2-8:192 blocked for more than 120 seconds. [5363099.811819] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5363099.814246] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5363099.818255] INFO: task cadvisor:434 blocked for more than 120 seconds. [5363099.820836] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5363099.823331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5363099.825677] INFO: task [ET_AIO 2:34]:23748 blocked for more than 120 seconds. [5363099.827357] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5363099.828802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5363099.830741] INFO: task [ET_TASK 1]:23757 blocked for more than 120 seconds. [5363099.832340] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5363099.833801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [5363099.835129] INFO: task mtail:8027 blocked for more than 120 seconds. [5363099.836105] Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1 [5363099.837083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6075128.106940] Out of memory: Kill process 23727 ([TS_MAIN]) score 420 or sacrifice child [6075128.107930] Killed process 23737 (traffic_crashlo) total-vm:15904kB, anon-rss:984kB, file-rss:0kB, shmem-rss:0kB [6076160.882743] Out of memory: Kill process 23727 ([TS_MAIN]) score 420 or sacrifice child [6076160.884022] Killed process 23727 ([TS_MAIN]) total-vm:2707532kB, anon-rss:1693880kB, file-rss:0kB, shmem-rss:0kB
This task is to investigate why that happens and implement a solution to fix that. Possible solutions:
- Increasing resources allocated to the singular virtual machine
- Use LVS or HAProxy in TCP mode to distribute requests among multiple cache nodes