Page MenuHomePhabricator

python-varnishapi daemons seeing "Log overrun" constantly
Closed, ResolvedPublic

Description

Example from cp1068 (sudo journalctl -u $service --since '1 hour ago'):

Nov 25 15:13:42 cp1068 varnishreqstats[24641]: Error in execute(): Log reacquired
Nov 25 15:13:45 cp1068 varnishreqstats[24641]: Error in execute(): Log overrun
Nov 25 15:13:45 cp1068 varnishreqstats[24641]: Error in execute(): Log reacquired
Nov 25 15:13:47 cp1068 varnishreqstats[24641]: Error in execute(): Log overrun
Nov 25 15:13:47 cp1068 varnishreqstats[24641]: Error in execute(): Log reacquired

Nov 25 15:16:59 cp1068 varnishxcache[19844]: Error in execute(): Log reacquired
Nov 25 15:17:02 cp1068 varnishxcache[19844]: Error in execute(): Log overrun
Nov 25 15:17:02 cp1068 varnishxcache[19844]: Error in execute(): Log reacquired
Nov 25 15:17:06 cp1068 varnishxcache[19844]: Error in execute(): Log overrun

Nov 25 15:16:44 cp1068 varnishxcps[19840]: Error in execute(): Log reacquired
Nov 25 15:16:47 cp1068 varnishxcps[19840]: Error in execute(): Log overrun
Nov 25 15:16:48 cp1068 varnishxcps[19840]: Error in execute(): Log reacquired
Nov 25 15:16:51 cp1068 varnishxcps[19840]: Error in execute(): Log overrun
Nov 25 15:16:51 cp1068 varnishxcps[19840]: Error in execute(): Log reacquired
Nov 25 15:16:55 cp1068 varnishxcps[19840]: Error in execute(): Log overrun

From https://info.varnish-software.com/blog/varnish-shared-memory-log-errors-and-solutions:

Cursor initialization failure: This error happens when varnishd is writing to the Shared Memory Log faster then the program that is trying to read from it. Since the VSL is a circular buffer Varnish can overrun it and if it happens, the program reading the VSL will end up with inconsistent data. To be honest I’ve never seen anyone affected by this issue, but the solution would be to have a persistent VSL so the slow program accessing it will have enough time to read it with consistency.

Varnish code major suspects:

https://github.com/varnishcache/varnish-cache/blob/4.1/lib/libvarnishapi/vsl_cursor.c#L130
https://github.com/varnishcache/varnish-cache/blob/4.1/lib/libvarnishapi/vsl_cursor.c#L201

Event Timeline

elukey triaged this task as High priority.Nov 25 2016, 4:09 PM
elukey renamed this task from varnishlog daemons seeing Log overrun constantly to python-varnishapi daemons seeing "Log overrun" constantly.Nov 25 2016, 4:19 PM

Change 323829 had a related patch set uploaded (by Ema):
varnish: make python varnish scripts skip PURGE requests

https://gerrit.wikimedia.org/r/323829

Me and @elukey have been working on this a bit on Friday. Specifically, we have tried to bump vsl_space:

vsl_space
       · Units: bytes

       · Default: 80M

       · Minimum: 1M

       · Flags: must_restart

    The amount of space to allocate for the VSL fifo buffer in the VSM memory segment.  If you make this too small, var‐
    nish{ncsa|log} etc will not be able to keep up.  Making it too large just costs memory resources.

Setting the value to 512M didn't seem to make much of a difference. We've then noticed another varnishd flag, -l:

-l <vsl[,vsm]>
       Specifies  size  of  shmlog  file.  vsl is the space for the VSL records [80M] and vsm is the space for stats
       counters [1M]. Scaling suffixes like 'K' and 'M' can be used up to (G)igabytes.  Default is 81 Megabytes.

By restarting varnishd with -p vsl_space=512M -l 512M,1M we've observed that the problem was still present although it seemed to occur less often. Note that the /var/lib/varnish tmpfs is 512M only so we had to increase that by hand (to 1G).

For the python scripts that do not need to take PURGE requests into account, it seems like a good idea to filter them out with VSL queries similar to the one used by varnishkafka : ReqMethod ne "PURGE" .

I don't think we can do that for varnishreqstats though. Let's see if filtering out PURGE requests helps for the other scripts, meanwhile.

Change 323829 merged by Ema:
varnish: make python varnish scripts skip PURGE requests

https://gerrit.wikimedia.org/r/323829

I've confirmed with stap that the overruns come from vslc_vsm_next, which in turn calls vslc_vsm_check: https://github.com/varnishcache/varnish-cache/blob/4.1/lib/libvarnishapi/vsl_cursor.c#L107

Increasing the vsl buffer size (-l) should improve the situation, we should try again.

See also: https://github.com/varnishcache/varnish-cache/issues/1873

Mentioned in SAL (#wikimedia-operations) [2016-11-30T12:55:17Z] <ema> bumping vsl log buffer on cp3032 (depooled) -- T151643

Quick summary from my tests of today:

  • most of the time is spent calling _callBack() in varnishapi.py:879`
  • regarding the 2 while 1: loops inside the _callBack() function
    • most of the time is spent calling the binded functions in libvarnishapi.so.1, as expected
    • for 100k calls to execute() in varnishlog.py:94 there are ~3.1M calls to the first lines of the inner loop. I need to check why does this but I think is needed based on how the API works. Not sure if there is room for improvement here.

Some speedup should be achievable on our side of the code trying:

  • to port it to Python3 because it has a lots of performance improvements in general
  • verify that using collections.Counter instead of the dictionary for the default_counts in varnishreqstats improve the performances (only on Python3, where it was ported to C)
  • move the valid_http_methods() and is_valid_http_status() functions code in the only place where they are called to avoid an additional function call, they are simple enough to fit in one line.
  • rewrite is_valid_http_status() for better performances, for example using if on the string value I've got a 24% improvement so far and surely could be improved further

But I'm not sure it this could give it enough speedup to cope with the traffic and avoid to be overrun by Varnish circular log.
The other obvious approach is of course to move away from Python, that surely is not the quickest for this kind of workload, using some lower level language that has a good and optimized Varnish API library.
While another alternative approach could be to move as much as possible of the looping part of the code in varnishapi.py to C, either manually as a module or using Cython.

Here some raw results:

  • From cp1065 with cProfile, 1M calls
       16195753 function calls (16194418 primitive calls) in 65.830 seconds

 Ordered by: internal time

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 187804   41.741    0.000   58.305    0.000 varnishapi.py:879(_callBack)
1000000    5.386    0.000   64.275    0.000 varnishapi.py:822(__cbMain)
1291791    4.427    0.000    5.535    0.000 varnishreqstats:148(count_vsl_entry_callback)
1291791    2.358    0.000    2.358    0.000 __init__.py:505(string_at)
1291791    1.806    0.000    8.042    0.000 varnishapi.py:635(VSL_DATA)
1291791    1.772    0.000    7.887    0.000 varnishlog.py:110(vap_callback)
1291963    1.703    0.000    1.703    0.000 {_codecs.utf_8_decode}
1291963    1.449    0.000    3.878    0.000 {method 'decode' of 'str' objects}
      1    0.814    0.814   65.795   65.795 varnishlog.py:94(execute)
     21    0.752    0.036    0.752    0.036 {time.sleep}
  • From cp1008 with line_profiler on _callBack() function in varnishapi.py:898
Timer unit: 1e-06 s

Total time: 37.6501 s
File: /home/volans/varnishapi.py
Function: _callBack at line 898

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   898                                               @do_profile(profiler=profiler)
   899                                               def _callBack(self, vsl, pt, fo):
   900     49201        61590      1.3      0.2          idx = -1
   901     49201        48493      1.0      0.1          while 1:
   902     98402        95028      1.0      0.3              idx += 1
   903     98402       126328      1.3      0.3              t = pt[idx]
   904     98402       140824      1.4      0.4              if not bool(t):
   905     49201        48341      1.0      0.1                  break
   906     49201        58135      1.2      0.2              tra = t[0]
   907     49201        47526      1.0      0.1              cbd = {
   908     49201        71409      1.5      0.2                  'level': tra.level,
   909     49201        60115      1.2      0.2                  'vxid': tra.vxid,
   910     49201        54834      1.1      0.1                  'vxid_parent': tra.vxid_parent,
   911     49201        54446      1.1      0.1                  'reason': tra.reason,
   912                                                       }
   913
   914
   915     49201        45454      0.9      0.1              while 1:
   916   3101494      7442718      2.4     19.8                  i = self.lva.VSL_Next(tra.c)
   917   3101494      3196813      1.0      8.5                  if i < 0:
   918                                                               return (i)
   919   3101494      2808849      0.9      7.5                  if i == 0:
   920     49201        50110      1.0      0.1                      break
   921   3052293      7921133      2.6     21.0                  if not self.lva.VSL_Match(self.vsl, tra.c):
   922   2757087      2664516      1.0      7.1                      continue
   923
   924                                                           # decode vxid type ...
   925    295206       554288      1.9      1.5                  ptr = tra.c[0].rec.ptr
   926    295206       494433      1.7      1.3                  cbd['length'] = ptr[0] & 0xffff
   927    295206       767859      2.6      2.0                  cbd['tag'] = self.VSL_TAG(ptr)
   928    295206       395277      1.3      1.0                  if ptr[1] & (1 << 30):
   929    295206       314171      1.1      0.8                      cbd['type'] = 'c'
   930                                                           elif ptr[1] & (1 << 31):
   931                                                               cbd['type'] = 'b'
   932                                                           else:
   933                                                               cbd['type'] = '-'
   934    295206       468138      1.6      1.2                  cbd['isbin'] = self.VSL_tagflags[cbd['tag']] & self.defi.SLT_F_BINARY
   935    295206       468974      1.6      1.2                  isbin = cbd['isbin'] == self.defi.SLT_F_BINARY or not self.dataDecode
   936    295206      2966224     10.0      7.9                  cbd['data'] = self.VSL_DATA(ptr, isbin)
   937
   938    295206       393203      1.3      1.0                  if self._cb:
   939    295206      5788628     19.6     15.4                      self._cb(self, cbd, self._priv)
   940     49201        42214      0.9      0.1          return(0)

With python-varnishapi, varnishlog.py and friends we're essentially solving the (already solved) problem of reading from VSM, and we're not doing that efficiently enough.

As a PoC, I've tried modifying varnishxcps by using varnishncsa instead. The modifications are pretty minimal and it seems to work just fine.

This is basically what my experiment boils down to:

cmd = ["varnishncsa", "-n", "frontend",
       "-q", 'ReqMethod ne "PURGE"',
       "-F", '%{X-Connection-Properties}i']

p = Popen(cmd, stdout=PIPE, bufsize=-1)

while True:
    line = p.stdout.readline()
    vsl_callback(line)

On a upload host in esams, cp3034, the script cpu usage is ~2%, and varnishncsa's resource usage is similar. It seems to me this is an approach worth exploring further.

Change 326965 had a related patch set uploaded (by Ema):
varnishxcps: use varnishncsa to read log entries from the VSM

https://gerrit.wikimedia.org/r/326965

Change 326965 merged by Ema:
varnishxcps: use varnishncsa to read log entries from the VSM

https://gerrit.wikimedia.org/r/326965

Change 327504 had a related patch set uploaded (by Ema):
varnish cachestats.py: make key_prefix configurable

https://gerrit.wikimedia.org/r/327504

Change 327511 had a related patch set uploaded (by Ema):
varnish cachestats.py: strip final newline

https://gerrit.wikimedia.org/r/327511

Change 327513 had a related patch set uploaded (by Ema):
varnishmedia: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/327513

Change 327516 had a related patch set uploaded (by Ema):
varnishxcps: subscribe to cachestats.py

https://gerrit.wikimedia.org/r/327516

Change 327516 merged by Ema:
varnishxcps: subscribe to cachestats.py

https://gerrit.wikimedia.org/r/327516

Change 327504 merged by Ema:
varnish cachestats.py: make key_prefix configurable

https://gerrit.wikimedia.org/r/327504

Change 327511 merged by Ema:
varnish cachestats.py: strip final newline

https://gerrit.wikimedia.org/r/327511

Change 327513 merged by Ema:
varnishmedia: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/327513

Change 327733 had a related patch set uploaded (by Ema):
varnishrls: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/327733

Change 327733 merged by Ema:
varnishrls: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/327733

Change 328179 had a related patch set uploaded (by Ema):
varnishxcache: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/328179

Change 328180 had a related patch set uploaded (by Ema):
varnish: remove varnishprocessor

https://gerrit.wikimedia.org/r/328180

Change 328368 had a related patch set uploaded (by Ema):
varnish cachestats.py: add support for defaults and key_prefix

https://gerrit.wikimedia.org/r/328368

Change 328180 merged by Ema:
varnish: remove varnishprocessor

https://gerrit.wikimedia.org/r/328180

Change 328368 merged by Ema:
varnish cachestats.py: add support for defaults and key_prefix

https://gerrit.wikimedia.org/r/328368

Change 328688 had a related patch set uploaded (by Ema):
WIP: varnishreqstats: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/328688

Change 328179 merged by Ema:
varnishxcache: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/328179

Because of the Log overrun issue we are actually losing quite a lot of information. I'm now comparing the values produced by the new implementation of varnishreqstats against the old one. On cp2019, a not-so-busy machine currently not affected by overruns, the metrics produced by the two different implementations are very similar. The first column is the metric name, the second one is the value of the "old" varnishreqstats using python-varnishapi and the third one is the new version using varnishncsa:

client.status.ok 27259 27092
client.status.204 25516 25517
client.method.purge 25510 25511

On cp3040, which is currently quite busy and showing frequent overruns in the logs:

client.status.ok 44570 69630
client.status.204 16366 26958
client.method.purge 15872 26101

Those values have been obtained by running both implementations of varnishreqstats for 10 seconds and comparing the results. Two overruns have been logged while running the python-varnishapi based implementation.

Change 328688 merged by Ema:
varnishreqstats: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/328688

Change 330668 had a related patch set uploaded (by Ema):
varnishstatsd: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/330668

Change 330668 merged by Ema:
varnishstatsd: port to cachestats.CacheStatsSender

https://gerrit.wikimedia.org/r/330668

Change 333296 had a related patch set uploaded (by Volans):
varnishstatsd: temporary fix to avoid crashes

https://gerrit.wikimedia.org/r/333296

Change 333296 merged by Volans:
varnishstatsd: temporary fix to avoid crashes

https://gerrit.wikimedia.org/r/333296

Change 333591 had a related patch set uploaded (by Ema):
varnishstatsd: exclude piped requests

https://gerrit.wikimedia.org/r/333591

Change 333591 merged by Ema:
varnishstatsd: exclude piped requests

https://gerrit.wikimedia.org/r/333591

Logged on cp4006 and I can see Jul 19 12:03:09 cp4006 varnishstatsd[2413]: Log overrun in the syslog, so nope :(

Yeah, all other daemons have been fixed but varnishstatsd seems to still be affected by this issue.

Incidentally, while looking at entirely different stuff on esams recdns hosts, I've noticed that the vast majority of our DNS traffic there is due to cache hosts continuously asking for the A record of statsd.eqiad.wmnet.

So it looks like the varnishstatsd overruns occur mostly in ulsfo:

$ sudo grep -F 'Log overru' /srv/syslog/syslog.log | grep cp4 -c
20737
18:58:30 ema@wezen.codfw.wmnet:~
$ sudo grep -F 'Log overru' /srv/syslog/syslog.log | grep -v cp4 -c
568

A big difference between ulsfo and other DCs is that there's no local recursive DNS there.

So that, together with the observation in the comment above about the insane amount of DNS queries for statsd.eqiad.wmnet against our recursive DNSs made me try to hardcode the statsd server IP into varnishstatsd-default.service on cp4017 and sure enough the overruns stopped immediately.

For the record, I've sampled 5 seconds of DNS traffic on maerlant and acamar. On the former, 835 out of 864 requests concerned statsd.eqiad.wmnet. On acamar it's 2656 out of 5442.

Change 366564 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish cachestats.py: cache statsd server IP

https://gerrit.wikimedia.org/r/366564

Change 366564 merged by Ema:
[operations/puppet@production] varnish cachestats.py: cache statsd server IP

https://gerrit.wikimedia.org/r/366564

ema claimed this task.

The last overrun was logged about half an hour ago.

Jul 21 14:58:04 cp4009 varnishstatsd[46915]: Log overrun

Closing.