Page MenuHomePhabricator

varnish 7.1.1-1.1~bpo11+wmf1 crash
Closed, ResolvedPublic

Description

varnish crashed this morning on cp3066:

Wrong turn at cache/cache_wrk.c:629:
Worker Pool Queue does not move
version = varnish-7.1.1 revision 7cee1c581bead20e88d101ab3d72afb29f14d87a, vrt api = 15.0
ident = Linux,5.10.0-30-amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
now = 23667458.893371 (mono), 1744103393.647477 (real)
Backtrace:
  0x56342d11c54c: /usr/sbin/varnishd(+0x5954c) [0x56342d11c54c]
  0x56342d1980f8: /usr/sbin/varnishd(VAS_Fail+0x18) [0x56342d1980f8]
  0x56342d14473c: /usr/sbin/varnishd(pool_herder+0x88c) [0x56342d14473c]
  0x7f4c28a16ea7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f4c28a16ea7]
  0x7f4c28936acf: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f4c28936acf]

this has been reported to upstream in the past on https://github.com/varnishcache/varnish-cache/issues/3868 but no fix has been made public for that AFAIK

The comment from Nils Goroll is a nice hint though:

Firstly, the intention behind this panic is a good one - to improve your system's uptime by restarting the varnish worker process if something goes terribly wrong. The respective parameter is thread_pool_watchdog.
The potential reasons for the watchdog barking are numerous: Your varnish worker did not run for some time, system resources are exhausted (in particular, memory), your vm got migrated. We really can't know.

Current status: 7.1.1-1.1~bpo11+wmf3 shipping backports of https://github.com/varnishcache/varnish-cache/pull/3947 && https://github.com/varnishcache/varnish-cache/pull/3827 deployed globally. Waiting for another occurrence of the crash (2025-04-15)

Event Timeline

the thread worker watchdog has been introduced in Varnish 6.2, from their changelog:

We have added a “watchdog” for thread pools that will panic the worker process, causing it to restart, if scheduling tasks onto worker threads appears to be deadlocking. The length of time until the panic is set by the thread_pool_watchdog parameter. If this happens, it probably means that thread pools are too small, and you should consider increasing the parameters thread_pool_min, thread_pool_max and/or thread_pools

from cp3066 crash, we know that:

pools = {
  pool = 0x7f4c1c000640 {
  nidle = 0,
  nthr = 358,
   lqueue = 1
 },
pool = 0x7f4c1c000000 {
  nidle = 34,
  nthr = 381,
  lqueue = 0
  },
},

pool at 0x7f4c1c000640 got an element on the queue that didn't get dequeued in 60 seconds and that triggered the crash. What it's strange is that the pool had 358 threads and we set thread_pool_max to 12000, so something is preventing Varnish from spawning new threads under certain conditions

Change #1135080 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Remove varnish-staging, add varnish6 components

https://gerrit.wikimedia.org/r/1135080

Change #1135080 merged by BCornwall:

[operations/puppet@production] Remove varnish-staging, add varnish6 components

https://gerrit.wikimedia.org/r/1135080

Mentioned in SAL (#wikimedia-operations) [2025-04-08T18:03:51Z] <brett> import varnish 6.0.13-1wm1 to component/varnish6 bullseyw-wikimedia (T391334)

Mentioned in SAL (#wikimedia-operations) [2025-04-08T20:22:07Z] <brett> import libvmod-re2 1.5.3-4 to component/varnish6 bullseyw-wikimedia (T391334)

Mentioned in SAL (#wikimedia-operations) [2025-04-08T20:25:49Z] <brett> import varnishkafka 1.1.0-4 to component/varnish6 bullseyw-wikimedia (T391334)

Mentioned in SAL (#wikimedia-operations) [2025-04-08T20:37:07Z] <brett> import varnish-modules 0.15.0-3 to component/varnish6 bullseye-wikimedia (T391334)

Mentioned in SAL (#wikimedia-operations) [2025-04-08T20:51:33Z] <brett> import libvmod-querysort 0.4-2 to component/varnish6 bullseye-wikimedia (T391334)

Mentioned in SAL (#wikimedia-operations) [2025-04-08T21:19:24Z] <brett> import libvmod-netmapper 1.9-4 to component/varnish6 bullseye-wikimedia (T391334)

Should we decide to go forward with downgrading, Varnish 6 and related packages have been imported into the bullseye-wikimedia component/varnish6 component.

We'd perform the following:

  • re-introduce the Varnish switches that were removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1132765 (with varnish_ge_v7 set to "true" by default this time)
  • Introduce priority pinning of Varnish 6 components in /etc/apt/preferences.d/apt_pin_varnish6.pref for hosts with varnish_ge_v7:
Package: *
Pin: release c=component/varnish6
Pin-Priority: 1002

Then start downgrading:

  • Switch hosts that we're downgrading to profile::cache::varnish::frontend::varnish_ge_v7: false and profile::cache::varnish::frontend::packages_component: component/varnish6
  • Run the sre.cdn.roll-upgrade-varnish cookbook. This will:
    • Run puppet on the host, causing the older VCL to be installed and Varnish 6 pinning to be performed (but not yet installed). This Puppet run will fail because Varnish will not be able to be restarted correctly (since it still has Varnish 7 installed)
    • Explicitly install each Varnish package, which will be downgraded from component/varnish6 due to the pinning.
    • Re-run puppet to make any adjustments
    • Restart all Varnish-related services

FWIW varnish@cp3066 didn't report any thread creation failure before the crash: https://grafana.wikimedia.org/goto/K_XG9oANg?orgId=1

image.png (468×1 px, 37 KB)

Mentioned in SAL (#wikimedia-operations) [2025-04-09T15:11:58Z] <vgutierrez> upgrading to varnish 7.1.1-1.1~bpo11+wmf3 in cp3073 (text) and cp3081 (upload) - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-14T10:35:55Z] <vgutierrez> upload varnish 7.1.1-1.1~bpo11+wmf3 to apt.wm.o (bullseye-wikimedia) - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-14T10:43:29Z] <vgutierrez> rolling upgrade to varnish 7.1.1-1..1~bpo11+wmf3 in ulsfo - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-14T11:24:13Z] <vgutierrez> upload varnishkafka 1.2.0-3 to apt.wm.o (bullseye-wikimedia) - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-14T13:17:30Z] <vgutierrez> rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in magru - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-15T07:04:28Z] <vgutierrez> rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in eqsin and codfw - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-15T10:52:26Z] <vgutierrez> rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in drmrs - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-15T11:07:39Z] <vgutierrez> rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in esams - T391334

Mentioned in SAL (#wikimedia-operations) [2025-04-15T14:25:37Z] <vgutierrez> rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in eqiad - T391334

Vgutierrez renamed this task from varnish 7.1.1 crash to varnish 7.1.1-1.1~bpo11+wmf1 crash.Apr 15 2025, 6:53 PM
Vgutierrez changed the task status from Open to Stalled.
Vgutierrez updated the task description. (Show Details)
BCornwall claimed this task.

It seems like we've had good fortune and not experienced a crash. I'm going to be bold and close this - if misfortune should hit us again let's reopen it.

Thanks for the patches, @Vgutierrez !