Figure out why deployment-cache-text06 keeps crashing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	taavi
	Jul 12 2021, 6:10 PM

Description

According to the action log on Horizon I've had to force reboot deployment-cache-text06 three times since April this year. As far as I remember, all have been similar cases where the instance continues to serve traffic but is refusing SSH connections into it and shows up as a Puppet failure on https://vpsalertmanager.toolforge.org/?q=project%3Ddeployment-prep. This time the logs look like this:

[3210793.683289] Out of memory: Kill process 17901 ([TS_MAIN]) score 449 or sacrifice child
[3210793.684360] Killed process 17915 (traffic_crashlo) total-vm:15904kB, anon-rss:980kB, file-rss:0kB, shmem-rss:0kB
[3210798.430599] Out of memory: Kill process 17901 ([TS_MAIN]) score 449 or sacrifice child
[3210798.432936] Killed process 17901 ([TS_MAIN]) total-vm:2929664kB, anon-rss:1814184kB, file-rss:0kB, shmem-rss:0kB
[5362978.986816] INFO: task jbd2/vda2-8:192 blocked for more than 120 seconds.
[5362978.989405]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362978.991796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362978.995270] INFO: task [ET_TASK 1]:23757 blocked for more than 120 seconds.
[5362978.997892]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362978.999709] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362979.001889] INFO: task mtail:8027 blocked for more than 120 seconds.
[5362979.003488]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362979.005089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362979.007120] INFO: task mtail:22308 blocked for more than 120 seconds.
[5362979.008731]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362979.009726] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5362979.011053] INFO: task mtail:13827 blocked for more than 120 seconds.
[5362979.012115]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5362979.013138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.809121] INFO: task jbd2/vda2-8:192 blocked for more than 120 seconds.
[5363099.811819]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.814246] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.818255] INFO: task cadvisor:434 blocked for more than 120 seconds.
[5363099.820836]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.823331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.825677] INFO: task [ET_AIO 2:34]:23748 blocked for more than 120 seconds.
[5363099.827357]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.828802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.830741] INFO: task [ET_TASK 1]:23757 blocked for more than 120 seconds.
[5363099.832340]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.833801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5363099.835129] INFO: task mtail:8027 blocked for more than 120 seconds.
[5363099.836105]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
[5363099.837083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[6075128.106940] Out of memory: Kill process 23727 ([TS_MAIN]) score 420 or sacrifice child
[6075128.107930] Killed process 23737 (traffic_crashlo) total-vm:15904kB, anon-rss:984kB, file-rss:0kB, shmem-rss:0kB
[6076160.882743] Out of memory: Kill process 23727 ([TS_MAIN]) score 420 or sacrifice child
[6076160.884022] Killed process 23727 ([TS_MAIN]) total-vm:2707532kB, anon-rss:1693880kB, file-rss:0kB, shmem-rss:0kB

This task is to investigate why that happens and implement a solution to fix that. Possible solutions:

Increasing resources allocated to the singular virtual machine
Use LVS or HAProxy in TCP mode to distribute requests among multiple cache nodes

Details

	Subject	Repo	Branch	Lines +/-
	beta: lower ATS ram cache size to 128M	operations/puppet	production	+2 -0
	ATS: make backend ram cache size configurable	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects

Mentioned In: T286506: Varnish not starting on deployment-cache-text06

Event Timeline

taavi triaged this task as High priority.Jul 12 2021, 6:10 PM

taavi created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2021, 6:10 PM

taavi mentioned this in T286506: Varnish not starting on deployment-cache-text06.Jul 12 2021, 6:43 PM

Mentioned in SAL (#wikimedia-releng) [2021-08-20T16:30:18Z] <majavah> restart sssd on deployment-cache-text06, T286502?

Mentioned in SAL (#wikimedia-cloud) [2021-10-05T12:03:55Z] <majavah> root@deployment-cache-text06:~# systemctl restart sssd # T286502

The instance has 4G of memory, of which up to 1G is used by the varnish cache and 2G by the ATS backend ram cache (setting proxy.config.cache.ram_cache.size). Indeed the system is sometimes running out of memory, and the OOM killer is sacrificing the ATS process in those cases:

root@deployment-cache-text06:~# dmesg -T | grep 'Out of'
[Fri Aug  6 06:51:21 2021] Out of memory: Kill process 592 ([TS_MAIN]) score 355 or sacrifice child
[Fri Aug  6 06:51:22 2021] Out of memory: Kill process 592 ([TS_MAIN]) score 355 or sacrifice child
[Fri Sep 10 03:15:05 2021] Out of memory: Kill process 7477 ([TS_MAIN]) score 398 or sacrifice child
[Fri Sep 10 03:15:23 2021] Out of memory: Kill process 7477 ([TS_MAIN]) score 398 or sacrifice child
[Tue Sep 28 07:37:31 2021] Out of memory: Kill process 25081 ([TS_MAIN]) score 457 or sacrifice child
[Tue Sep 28 07:38:41 2021] Out of memory: Kill process 25081 ([TS_MAIN]) score 457 or sacrifice child

We should definitely lower the ram cache size for ATS there and see if it helps. Also perhaps worth investigating if sssd gets confused when memory is low?

• ema added a project: Traffic.Oct 5 2021, 12:27 PM

Change 726607 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] ATS: make backend ram cache size configurable

https://gerrit.wikimedia.org/r/726607

gerritbot added a project: Patch-For-Review.Oct 5 2021, 12:36 PM

Maintenance_bot added a project: SRE.Oct 5 2021, 12:45 PM

Change 726607 merged by Ema:

[operations/puppet@production] ATS: make backend ram cache size configurable

https://gerrit.wikimedia.org/r/726607

Change 726886 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] beta: lower ATS ram cache size to 128M

https://gerrit.wikimedia.org/r/726886

Change 726886 merged by Ema:

[operations/puppet@production] beta: lower ATS ram cache size to 128M

https://gerrit.wikimedia.org/r/726886

After lowering the amount of memory used for the ATS backend ram cache, there's now some more available on the system:

ema@deployment-cache-text06:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3946        1377        1974         288         595        2066

Tentatively closing the task, feel free to reopen if the instance misbehaves again.

Maintenance_bot removed a project: Patch-For-Review.Oct 6 2021, 1:10 PM

Figure out why deployment-cache-text06 keeps crashingClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Figure out why deployment-cache-text06 keeps crashing
Closed, ResolvedPublic
Actions