Page MenuHomePhabricator

vhtcpd segfaulted and did not get restarted
Closed, ResolvedPublic

Description

vhctpd unexpectedly stopped running on cp1075 11 hours ago:

2019-12-15 20:44 <+icinga-wm> PROBLEM - Varnish HTCP daemon on cp1075 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish

The reason for this is a segmentation fault:

Dec 15 20:38:42 cp1075 kernel: vhtcpd[148095]: segfault at 0 ip 00007f17d82fc231 sp 00007fff7bddc828 error 4 in libc-2.24.so[7f17d81d3000+195000]

I have depooled the host at 07:52.

The unit logs do not show anything interesting, just that the process exited:

root@cp1075:~# systemctl status vhtcpd.service 
● vhtcpd.service - LSB: vhtcpd
   Loaded: loaded (/etc/init.d/vhtcpd; generated; vendor preset: enabled)
   Active: active (exited) since Wed 2019-09-25 00:02:22 UTC; 2 months 21 days ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 0 (limit: 41779)
   CGroup: /system.slice/vhtcpd.service

Dec 15 19:47:22 cp1075 vhtcpd[148095]: Purger1:  input: 15701889632 failed: 32 q_size: 1030 q_mem: 191766 q_max_size: 11100 q_max_mem: 3623507
Dec 15 20:02:22 cp1075 vhtcpd[148095]: start: 1569369742 uptime: 7070399 purgers: 2 recvd: 15968398902 bad: 2 filtered: 0
Dec 15 20:02:22 cp1075 vhtcpd[148095]: Purger0:  input: 15968398900 failed: 1355211 q_size: 264290641 q_mem: 40499198184 q_max_size: 264290641 q_max_mem: 40499198184
Dec 15 20:02:22 cp1075 vhtcpd[148095]: Purger1:  input: 15702753047 failed: 32 q_size: 977 q_mem: 211345 q_max_size: 11100 q_max_mem: 3623507
Dec 15 20:17:22 cp1075 vhtcpd[148095]: start: 1569369742 uptime: 7071300 purgers: 2 recvd: 15970859344 bad: 2 filtered: 0
Dec 15 20:17:22 cp1075 vhtcpd[148095]: Purger0:  input: 15970859342 failed: 1355211 q_size: 265942528 q_mem: 40675459219 q_max_size: 265942594 q_max_mem: 40675467672
Dec 15 20:17:22 cp1075 vhtcpd[148095]: Purger1:  input: 15703561602 failed: 32 q_size: 1165 q_mem: 202575 q_max_size: 11100 q_max_mem: 3623507
Dec 15 20:32:22 cp1075 vhtcpd[148095]: start: 1569369742 uptime: 7072200 purgers: 2 recvd: 15973436580 bad: 2 filtered: 0
Dec 15 20:32:22 cp1075 vhtcpd[148095]: Purger0:  input: 15973436578 failed: 1355211 q_size: 267722646 q_mem: 40879916804 q_max_size: 267722646 q_max_mem: 40879916804
Dec 15 20:32:22 cp1075 vhtcpd[148095]: Purger1:  input: 15704358720 failed: 32 q_size: 1079 q_mem: 162098 q_max_size: 11100 q_max_mem: 3623507

Note that (1) the status is active instead of inactive (or failed), and (2) the process did not get restarted by systemd. The unit specifically has Restart=no, so I assume this is intentional?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Dec 16 2019, 8:05 AM
ema moved this task from Backlog to Caching on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T08:08:52Z] <ema> cp1075: manually start vhtcpd.service T240826

Mentioned in SAL (#wikimedia-operations) [2019-12-16T08:12:03Z] <ema> cp1075: wipe varnish-fe and ats-be caches due to missed purges T240826

The vhtcpd.service unit is auto-generated from the vthpcd sysvinit script by means of the systemd-sysv-generator, which simply printfs "Restart=No" to the systemd units it generates. We should create a native systemd unit (given that it's shipped via the vhtcpd deb package, we can also do this nicely gradually by means of a staged package update.

As to why systemd didn't detect the crashed service: With the information available in the sysv init and a forking daemon it cannot detect if the daemon crashed. There's a workaround of adding a

# pidfile: /var/run/vhtcpd.pid

comment below the "### END INIT INFO" section. That is parsed by the sysv-generator and adds a "PIDFile" option to the generated systemd unit (and with that systemd will supervise the PID specified there)

But let's simply create a native systemd unit instead of adding more tech debt :-)

Change 558138 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vhtcpd: convert service to systemd

https://gerrit.wikimedia.org/r/558138

Change 558138 merged by Ema:
[operations/puppet@production] vhtcpd: convert service to systemd

https://gerrit.wikimedia.org/r/558138

Change 558474 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vhtcpd: pass -r only once in systemd unit

https://gerrit.wikimedia.org/r/558474

Change 558474 merged by Ema:
[operations/puppet@production] vhtcpd: pass -r only once in systemd unit

https://gerrit.wikimedia.org/r/558474

ema claimed this task.

Now that systemd properly takes care of supervising vhtcpd.service, the service is automatically restarted upon SIGSEGV:

Dec 17 13:13:35 cp3050 systemd[1]: vhtcpd.service: Main process exited, code=killed, status=11/SEGV
Dec 17 13:13:35 cp3050 systemd[1]: vhtcpd.service: Unit entered failed state.
Dec 17 13:13:35 cp3050 systemd[1]: vhtcpd.service: Failed with result 'signal'.
Dec 17 13:13:35 cp3050 systemd[1]: vhtcpd.service: Service hold-off time over, scheduling restart.
Dec 17 13:13:35 cp3050 systemd[1]: Stopped vhtcpd.service.
Dec 17 13:13:35 cp3050 systemd[1]: Started vhtcpd.service.
Dec 17 13:13:35 cp3050 vhtcpd[269638]: libev timeout collection interval set to 0.100
Dec 17 13:13:35 cp3050 vhtcpd[269638]: Configuring purger 0 for 127.0.0.1:3128 with delay 0.000
Dec 17 13:13:35 cp3050 vhtcpd[269638]: Configuring purger 1 for 127.0.0.1:3127 with delay 1.000