Page MenuHomePhabricator

Scripts depending on varnishlog.py maxing out CPU usage on cache_misc
Closed, ResolvedPublic

Description

varnishxcache, varnishxcps and other scripts depending on varnishlog.py are showing very high cpu usage on cache_misc machines (running varnish 4). On cache_maps systems, also running varnish 4, the CPU usage is normal.

Event Timeline

ema triaged this task as High priority.Jun 6 2016, 3:34 PM

Which hosts were exhibiting this? The first one I looked at (cp4001) seems normal.

I found a few. It seems to be all the esams hosts, plus two of the eqiad hosts ( cp1051, cp1061 ).

The issue is reproducible on hosts running vhtcpd (eg: cp1061). Normal CPU usage on hosts *not* running vhtcpd (eg: cp1058).

So, besides the fact that we should check why vhtcpd is running on some misc nodes and not on others, this seems to be a scalability issue in the new varnishlog.py using python-varnishapi.

I've tried a simple ab -t60 http://cp1058.eqiad.wmnet/ and observed that varnihslog.py-dependant scripts quickly started increasing their CPU usage. That does not happen running ab against pinkunicorn, which of course uses the old varnishlog.py.

Change 293123 had a related patch set uploaded (by Ema):
varnishlog4.py: log errors in execute()

https://gerrit.wikimedia.org/r/293123

It looks like the culprit might be lack of grouping at the varnishlog level.

The following experiment is currently ongoing on cp1061:

  • Added logging of errors to varnishlog.py
  • Hacked -g request into varnishxcache
  • Restarted all services depending on varnishlog.py: varnishxcache, varnishreqstats, varnishxcps, varnishstatsd

So far, the CPU usage of varnishxcache seems acceptable.

Change 293530 had a related patch set uploaded (by Ema):
varnishlog4: default to request grouping

https://gerrit.wikimedia.org/r/293530

Change 293530 merged by Ema:
varnishlog4: default to request grouping

https://gerrit.wikimedia.org/r/293530

Change 293123 merged by Ema:
varnishlog4.py: log errors in execute()

https://gerrit.wikimedia.org/r/293123

ema claimed this task.

Grouping transactions by requests solved the problem, confirmed on cp1051 and cp1061. Closing.