Scripts depending on varnishlog.py maxing out CPU usage on cache_misc
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Jun 6 2016, 3:30 PM

Description

varnishxcache, varnishxcps and other scripts depending on varnishlog.py are showing very high cpu usage on cache_misc machines (running varnish 4). On cache_maps systems, also running varnish 4, the CPU usage is normal.

Details

	Subject	Repo	Branch	Lines +/-
	varnishlog4.py: log errors in execute()	operations/puppet	production	+5 -0
	varnishlog4: default to request grouping	operations/puppet	production	+22 -0

Customize query in gerrit

Related Objects

Mentioned In: rOPUP3b011926fcf1: varnishlog4.py: log errors in execute()
rOPUPd9fc5db72389: varnishlog4.py: log errors in execute()
rOPUPc4bd8a20a5d7: varnishlog4: default to request grouping
rOPUP94be41eee654: varnishlog4.py: log errors in execute()
rOPUPfc9d61ff95d1: varnishlog4: default to request grouping

Event Timeline

• ema created this task.Jun 6 2016, 3:30 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 6 2016, 3:30 PM

• ema triaged this task as High priority.Jun 6 2016, 3:34 PM

Which hosts were exhibiting this? The first one I looked at (cp4001) seems normal.

I found a few. It seems to be all the esams hosts, plus two of the eqiad hosts ( cp1051, cp1061 ).

The issue is reproducible on hosts running vhtcpd (eg: cp1061). Normal CPU usage on hosts *not* running vhtcpd (eg: cp1058).

So, besides the fact that we should check why vhtcpd is running on some misc nodes and not on others, this seems to be a scalability issue in the new varnishlog.py using python-varnishapi.

I've tried a simple ab -t60 http://cp1058.eqiad.wmnet/ and observed that varnihslog.py-dependant scripts quickly started increasing their CPU usage. That does not happen running ab against pinkunicorn, which of course uses the old varnishlog.py.

Change 293123 had a related patch set uploaded (by Ema):
varnishlog4.py: log errors in execute()

https://gerrit.wikimedia.org/r/293123

gerritbot added a project: Patch-For-Review.Jun 7 2016, 4:56 PM

It looks like the culprit might be lack of grouping at the varnishlog level.

The following experiment is currently ongoing on cp1061:

Added logging of errors to varnishlog.py
Hacked -g request into varnishxcache
Restarted all services depending on varnishlog.py: varnishxcache, varnishreqstats, varnishxcps, varnishstatsd

So far, the CPU usage of varnishxcache seems acceptable.

Change 293530 had a related patch set uploaded (by Ema):
varnishlog4: default to request grouping

https://gerrit.wikimedia.org/r/293530

Change 293530 merged by Ema:
varnishlog4: default to request grouping

https://gerrit.wikimedia.org/r/293530

• ema mentioned this in rOPUPfc9d61ff95d1: varnishlog4: default to request grouping.Jun 10 2016, 9:05 AM

Change 293123 merged by Ema:
varnishlog4.py: log errors in execute()

https://gerrit.wikimedia.org/r/293123

• ema mentioned this in rOPUP94be41eee654: varnishlog4.py: log errors in execute().Jun 10 2016, 9:05 AM