varnishhtcpd occasionally stops responding to HTCP requests
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• RobLa-WMF
	Dec 26 2012, 11:50 PM

Description

Sometimes, varnishhtcpd will stop responding to HTCP requests due to unexplained thread corruption issues. In a recent example, the daemon was logging "Can't call method "accept" on an undefined value at /usr/local/bin/varnishhtcpd line 71" on purge requests. varnishhtcpd spawns worker threads, and apparently, sometimes the workers go on strike, and that's what the picket signs say. ;-)

A workaround that Asher proposes is to modify the daemon to kill itself when it gets in that state, which should cause upstart to respawn.

This problem was discovered in fixing the HTCP issues documented in the comments on bug 41130 (late December 2012 comments).

Version: unspecified
Severity: major

Details

Reference: bz43448

Related Objects
Search...

Status	Assigned	Task
Open	None	T43371 Thumbnail/imagescaler (tracking)
Resolved	None	T46508 htcp cache purges for images do not seem to clear europe upload squid caches
Invalid	Aklapper	T43130 Invalidation of Varnish thumbnail cache sometimes doesn't work
Resolved	None	T45448 varnishhtcpd occasionally stops responding to HTCP requests

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 1:08 AM

• bzimport added projects: WMF-General-or-Unknown, acl*sre-team.

• bzimport set Reference to bz43448.

• bzimport added a subscriber: Unknown Object (MLST).

• RobLa-WMF created this task.Dec 26 2012, 11:50 PM

[Sorry if slightly off-topic] Could we have some sort of monitoring of if things actually get purged. squid/varnish purging suddenly not working seems to have happened quite a few times in the past (all for different reasons), and we have no monitoring of it, (We don't even have any unit tests on the MW side for it as far as I am aware).

This is bad since:
*Most people don't have squid/varnish set up in their dev environment, so people don't notice on local test wikis.
*The symptoms are gradual, and usually pass unnoticed for some time
*With exception of images, the people primarily effected are anons, who are less likely to know how to report the issue.

(In reply to comment #1)
Moved "we should monitor HTCP purging' to a separate bug - bug 43449

FYI, more info posted on ops@ by Tim Starling ~6 hours ago:

varnishhtcpd daemon (listens on port 4827 for HTCP purges, and converts them to HTTP purges on localhost) deadlocked and stopped working on all upload hosts.

Details: "Apparently the worker threads deadlock each other in malloc/realloc. Then the queue overflows and the main thread tries to exit. The main thread closes its HTCP listen socket and then joins in with the deadlock. So it never exits and upstart can't respawn it."

https://gerrit.wikimedia.org/r/#/c/45302/

This should be fixed now. But the CPU usage is very high, so there may be some packet loss.

varnishhtcpd occasionally stops responding to HTCP requestsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

varnishhtcpd occasionally stops responding to HTCP requests
Closed, ResolvedPublic
Actions

Related Objects
Search...