Page MenuHomePhabricator

Figure out why deployment-cache-text06 keeps crashing
Closed, ResolvedPublic

Description

According to the action log on Horizon I've had to force reboot deployment-cache-text06 three times since April this year. As far as I remember, all have been similar cases where the instance continues to serve traffic but is refusing SSH connections into it and shows up as a Puppet failure on https://vpsalertmanager.toolforge.org/?q=project%3Ddeployment-prep. This time the logs look like this:

[3210793.683289] Out of memory: Kill process 17901 ([TS_MAIN]) score 449 or sacrifice child
[3210793.684360] Killed process 17915 (traffic_crashlo) total-vm:15904kB, anon-rss:980kB, file-rss:0kB, shmem-rss:0kB
[3210798.430599] Out of memory: Kill process 17901 ([TS_MAIN]) score 449 or sacrifice child
[3210798.432936] Killed process 17901 ([TS_MAIN]) total-vm:2929664kB, anon-rss:1814184kB, file-rss:0kB, shmem-rss:0kB
[5362978.986816] INFO: task jbd2/vda2-8:192 blocked for more than 120 seconds.
[5362978.989405]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362978.991796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362978.995270] INFO: task [ET_TASK 1]:23757 blocked for more than 120 seconds.
[5362978.997892]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362978.999709] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362979.001889] INFO: task mtail:8027 blocked for more than 120 seconds.
[5362979.003488]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362979.005089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362979.007120] INFO: task mtail:22308 blocked for more than 120 seconds.
[5362979.008731]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362979.009726] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362979.011053] INFO: task mtail:13827 blocked for more than 120 seconds.
[5362979.012115]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362979.013138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.809121] INFO: task jbd2/vda2-8:192 blocked for more than 120 seconds.
[5363099.811819]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.814246] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.818255] INFO: task cadvisor:434 blocked for more than 120 seconds.
[5363099.820836]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.823331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.825677] INFO: task [ET_AIO 2:34]:23748 blocked for more than 120 seconds.
[5363099.827357]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.828802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.830741] INFO: task [ET_TASK 1]:23757 blocked for more than 120 seconds.
[5363099.832340]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.833801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.835129] INFO: task mtail:8027 blocked for more than 120 seconds.
[5363099.836105]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.837083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[6075128.106940] Out of memory: Kill process 23727 ([TS_MAIN]) score 420 or sacrifice child
[6075128.107930] Killed process 23737 (traffic_crashlo) total-vm:15904kB, anon-rss:984kB, file-rss:0kB, shmem-rss:0kB
[6076160.882743] Out of memory: Kill process 23727 ([TS_MAIN]) score 420 or sacrifice child
[6076160.884022] Killed process 23727 ([TS_MAIN]) total-vm:2707532kB, anon-rss:1693880kB, file-rss:0kB, shmem-rss:0kB

This task is to investigate why that happens and implement a solution to fix that. Possible solutions:

  • Increasing resources allocated to the singular virtual machine
  • Use LVS or HAProxy in TCP mode to distribute requests among multiple cache nodes

Event Timeline

taavi triaged this task as High priority.Jul 12 2021, 6:10 PM
taavi created this task.

Mentioned in SAL (#wikimedia-releng) [2021-08-20T16:30:18Z] <majavah> restart sssd on deployment-cache-text06, T286502?

Mentioned in SAL (#wikimedia-cloud) [2021-10-05T12:03:55Z] <majavah> root@deployment-cache-text06:~# systemctl restart sssd # T286502

The instance has 4G of memory, of which up to 1G is used by the varnish cache and 2G by the ATS backend ram cache (setting proxy.config.cache.ram_cache.size). Indeed the system is sometimes running out of memory, and the OOM killer is sacrificing the ATS process in those cases:

root@deployment-cache-text06:~# dmesg -T | grep 'Out of'
[Fri Aug  6 06:51:21 2021] Out of memory: Kill process 592 ([TS_MAIN]) score 355 or sacrifice child
[Fri Aug  6 06:51:22 2021] Out of memory: Kill process 592 ([TS_MAIN]) score 355 or sacrifice child
[Fri Sep 10 03:15:05 2021] Out of memory: Kill process 7477 ([TS_MAIN]) score 398 or sacrifice child
[Fri Sep 10 03:15:23 2021] Out of memory: Kill process 7477 ([TS_MAIN]) score 398 or sacrifice child
[Tue Sep 28 07:37:31 2021] Out of memory: Kill process 25081 ([TS_MAIN]) score 457 or sacrifice child
[Tue Sep 28 07:38:41 2021] Out of memory: Kill process 25081 ([TS_MAIN]) score 457 or sacrifice child

We should definitely lower the ram cache size for ATS there and see if it helps. Also perhaps worth investigating if sssd gets confused when memory is low?

Change 726607 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] ATS: make backend ram cache size configurable

https://gerrit.wikimedia.org/r/726607

Change 726607 merged by Ema:

[operations/puppet@production] ATS: make backend ram cache size configurable

https://gerrit.wikimedia.org/r/726607

Change 726886 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] beta: lower ATS ram cache size to 128M

https://gerrit.wikimedia.org/r/726886

Change 726886 merged by Ema:

[operations/puppet@production] beta: lower ATS ram cache size to 128M

https://gerrit.wikimedia.org/r/726886

ema claimed this task.

After lowering the amount of memory used for the ATS backend ram cache, there's now some more available on the system:

ema@deployment-cache-text06:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3946        1377        1974         288         595        2066

Tentatively closing the task, feel free to reopen if the instance misbehaves again.