varnishxcache, varnishxcps and other scripts depending on varnishlog.py are showing very high cpu usage on cache_misc machines (running varnish 4). On cache_maps systems, also running varnish 4, the CPU usage is normal.
Description
Details
Related Objects
- Mentioned In
- rOPUP3b011926fcf1: varnishlog4.py: log errors in execute()
rOPUPd9fc5db72389: varnishlog4.py: log errors in execute()
rOPUPc4bd8a20a5d7: varnishlog4: default to request grouping
rOPUP94be41eee654: varnishlog4.py: log errors in execute()
rOPUPfc9d61ff95d1: varnishlog4: default to request grouping
Event Timeline
I found a few. It seems to be all the esams hosts, plus two of the eqiad hosts ( cp1051, cp1061 ).
The issue is reproducible on hosts running vhtcpd (eg: cp1061). Normal CPU usage on hosts *not* running vhtcpd (eg: cp1058).
So, besides the fact that we should check why vhtcpd is running on some misc nodes and not on others, this seems to be a scalability issue in the new varnishlog.py using python-varnishapi.
I've tried a simple ab -t60 http://cp1058.eqiad.wmnet/ and observed that varnihslog.py-dependant scripts quickly started increasing their CPU usage. That does not happen running ab against pinkunicorn, which of course uses the old varnishlog.py.
Change 293123 had a related patch set uploaded (by Ema):
varnishlog4.py: log errors in execute()
It looks like the culprit might be lack of grouping at the varnishlog level.
The following experiment is currently ongoing on cp1061:
- Added logging of errors to varnishlog.py
- Hacked -g request into varnishxcache
- Restarted all services depending on varnishlog.py: varnishxcache, varnishreqstats, varnishxcps, varnishstatsd
So far, the CPU usage of varnishxcache seems acceptable.
Change 293530 had a related patch set uploaded (by Ema):
varnishlog4: default to request grouping
Grouping transactions by requests solved the problem, confirmed on cp1051 and cp1061. Closing.