varnishhtcpd occasionally stops responding to HTCP requests
Closed, ResolvedPublic

Description

Sometimes, varnishhtcpd will stop responding to HTCP requests due to unexplained thread corruption issues. In a recent example, the daemon was logging "Can't call method "accept" on an undefined value at /usr/local/bin/varnishhtcpd line 71" on purge requests. varnishhtcpd spawns worker threads, and apparently, sometimes the workers go on strike, and that's what the picket signs say. ;-)

A workaround that Asher proposes is to modify the daemon to kill itself when it gets in that state, which should cause upstart to respawn.

This problem was discovered in fixing the HTCP issues documented in the comments on bug 41130 (late December 2012 comments).


Version: unspecified
Severity: major

bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz43448.
RobLa-WMF created this task.Via LegacyDec 26 2012, 11:50 PM
Bawolff added a comment.Via ConduitDec 27 2012, 12:02 AM

[Sorry if slightly off-topic] Could we have some sort of monitoring of if things actually get purged. squid/varnish purging suddenly not working seems to have happened quite a few times in the past (all for different reasons), and we have no monitoring of it, (We don't even have any unit tests on the MW side for it as far as I am aware).

This is bad since:
*Most people don't have squid/varnish set up in their dev environment, so people don't notice on local test wikis.
*The symptoms are gradual, and usually pass unnoticed for some time
*With exception of images, the people primarily effected are anons, who are less likely to know how to report the issue.

Bawolff added a comment.Via ConduitDec 27 2012, 12:25 AM

(In reply to comment #1)
Moved "we should monitor HTCP purging' to a separate bug - bug 43449

Aklapper added a comment.Via ConduitJan 22 2013, 1:29 PM

FYI, more info posted on ops@ by Tim Starling ~6 hours ago:

varnishhtcpd daemon (listens on port 4827 for HTCP purges, and converts them to HTTP purges on localhost) deadlocked and stopped working on all upload hosts.

Details: "Apparently the worker threads deadlock each other in malloc/realloc. Then the queue overflows and the main thread tries to exit. The main thread closes its HTCP listen socket and then joins in with the deadlock. So it never exits and upstart can't respawn it."

tstarling added a comment.Via ConduitJan 23 2013, 11:24 AM

This should be fixed now. But the CPU usage is very high, so there may be some packet loss.

Add Comment