Page MenuHomePhabricator

varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417
Closed, ResolvedPublic

Description

This morning we got the following alert:

09:48 < icinga-wm> PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1047 is CRITICAL: Connection refused

Indeed, the backend varnishd was dead on the machine:

 ● varnish.service - varnish (Varnish HTTP Accelerator)
   Loaded: loaded (/lib/systemd/system/varnish.service; enabled)
   Active: active (running) since Mon 2016-07-11 11:26:53 UTC; 1 months 1 days ago
 Main PID: 20529 (varnishd)
   CGroup: /system.slice/varnish.service
           └─20529 /usr/sbin/varnishd -P /run/varnish.pid -a :3128 -T 127.0.0.1:6083 -f /etc/varnish/wikimedia_maps-backend.vcl -p thread_pool_min=250 -p thread_pool_max=8000 -p thread_poo...

Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45716 127.0.0.1 6083 Wr 200 Message from VCC-compiler:
                                        Unused sub cluster_be_recv_applayer_backend, defined:
                                        ('/etc/varnish/maps-backend.inc.vcl' Line 5 Pos 5)...
Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Rd auth blabla
Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Wr 200 -----------------------------
                                        Varnish Cache CLI 1.0
                                        -----------------------------...
Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Rd ping
Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Wr 200 PONG 1470681171 1.0
Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Rd vcl.use 24125a4c-29a9-4702-b184-993780b4db4e
Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Wr 200 VCL '24125a4c-29a9-4702-b184-993780b4db4e' now active
Aug 12 07:43:17 cp1047 varnishd[20529]: Child (20225) not responding to CLI, killing it.
Aug 12 07:43:17 cp1047 varnishd[20529]: Child (20225) died signal=6
Aug 12 07:43:17 cp1047 varnishd[20529]: Child (20225) Last panic at: Fri, 12 Aug 2016 07:43:17 GMT
                                        "Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417:
                                          Condition((o)->magic == 0x32851d42) not true....
Aug 12 07:43:17 cp1047 varnishd[20529]: Child cleanup complete
Aug 12 07:43:17 cp1047 varnishd[20529]: Child (17652) Started
Aug 12 07:43:17 cp1047 varnishd[20529]: Child (17652) Pushing vcls failed:
                                        VCL "boot" Failed initialization
                                        Message:...
Aug 12 07:43:17 cp1047 varnishd[20529]: Stopping Child
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) died signal=6
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) Last panic at: Fri, 12 Aug 2016 07:43:18 GMT
                                        "Assert error in BAN_Shutdown(), cache/cache_ban.c line 798:
                                          Condition((pthread_join(ban_thread, &status)) == 0) not true....
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Child starts
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Dropped 0 segments to make free_reserve
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Dropped 0 segments to make free_reserve
Aug 12 07:43:18 cp1047 varnishd[20529]: Child cleanup complete
Hint: Some lines were ellipsized, use -l to show in full.

A simple systemctl varnish restart didn't fix the problem. Noticing that persistent storage seemed to be part of the equation, I've stopped the service, removed /srv/sd*/varnish* and started varnish.service again. The first time, the procedure failed. The second time it worked.

Event Timeline

ema triaged this task as Medium priority.Aug 12 2016, 8:03 AM

Full panic log on cp1047:~ema/varnishd-backend-crash.log.

ema renamed this task from varnish-backend crashed on cp1047 (maps) to varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417.Aug 12 2016, 9:54 AM

This error is happening only on maps nodes. We never noticed because usually varnishd starts a new child process if one dies. However, yesterday we installed varnish-modules which replaces libvmod_xkey. Thus, varnishd couldn't properly start a new child complaining that "This is no longer the same file seen by the VCL-compiler".

Aug 12 07:43:17 cp1047 varnishd[20529]: Child cleanup complete
Aug 12 07:43:17 cp1047 varnishd[20529]: Child (17652) Started
Aug 12 07:43:17 cp1047 varnishd[20529]: Child (17652) Pushing vcls failed:#012VCL "boot" Failed initialization#012Message:#012#011Loading VMOD header from /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_header.so:#012This is no longer the same file seen by the VCL-compiler.
Aug 12 07:43:17 cp1047 varnishd[20529]: Stopping Child
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) died signal=6
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) Last panic at: Fri, 12 Aug 2016 07:43:18 GMT#012"Assert error in BAN_Shutdown(), cache/cache_ban.c line 798:#012  Condition((pthread_join(ban_thread, &status)) == 0) not true.#012thread = (cache-main)#012version = varnish-4.1.3 revision 5e3b6d2#012ident = Linux,4.4.0-1-amd64,x86_64,-junix,-sdeprecated_persistent,-sdeprecated_persistent,-smalloc,-hcritbit,epoll#012Backtrace:#012  0x434115: varnishd() [0x434115]#012  0x4189c7: varnishd(BAN_Shutdown+0xc7) [0x4189c7]#012  0x430465: varnishd(child_main+0x155) [0x430465]#012  0x4547f7: varnishd() [0x4547f7]#012  0x4550b3: varnishd() [0x4550b3]#012  0x45526e: varnishd() [0x45526e]#012  0x7f811b8a37cb: libvarnish.so(vev_schedule_one+0x29b) [0x7f811b8a37cb]#012  0x7f811b8a3a38: libvarnish.so(vev_schedule+0x28) [0x7f811b8a3a38]#012  0x45551b: varnishd(MGT_Run+0x14b) [0x45551b]#012  0x40fe7d: varnishd(main+0x12cd) [0x40fe7d]#012#012"
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Child starts
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Dropped 0 segments to make free_reserve
Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Dropped 0 segments to make free_reserve
Aug 12 07:43:18 cp1047 varnishd[20529]: Child cleanup complete

Steps to reproduce while varnish.service is running fine with two processes, assuming varnish-modules is installed:

  • apt-get remove varnish-modules
  • apt-get install libvmod-header
  • kill -6 $PID_OF_CHILD

At this point varnish.service still thinks everything is OK, but only the parent varnishd will be running. Connections to :3128 won't work.

varnishd should arguably exit with an error if it fails to start its child properly: https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/mgt/mgt_child.c#L440

Change 304466 had a related patch set uploaded (by Ema):
cache_maps: switch to file storage backend

https://gerrit.wikimedia.org/r/304466

Change 304466 merged by Ema:
cache_maps: switch to file storage backend

https://gerrit.wikimedia.org/r/304466

Mentioned in SAL [2016-08-12T13:54:28Z] <ema> cache_maps varnish backends rolling restart (T142810)

Mentioned in SAL [2016-08-16T10:52:26Z] <ema> rolling restart of v4 varnishes (T142810)

Change 306180 had a related patch set uploaded (by Ema):
cache_upload: switch to file storage backend on Varnish 4

https://gerrit.wikimedia.org/r/306180

Change 306180 merged by Ema:
cache_upload: switch to file storage backend on Varnish 4

https://gerrit.wikimedia.org/r/306180

BBlack claimed this task.

No longer relevant (see T142848)

Change 319609 had a related patch set uploaded (by Ema):
cache_text: switch to file storage backend on Varnish 4

https://gerrit.wikimedia.org/r/319609

Change 319609 merged by BBlack:
cache_text: switch to file storage backend on Varnish 4

https://gerrit.wikimedia.org/r/319609