Page MenuHomePhabricator

varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage
Closed, ResolvedPublic

Description

On all text@esams hosts, varnish-frontend-fetcherr is currently using 100% of one CPU core. There's this in the journal:

May 11 20:12:35 cp3058 varnish-frontend-fetcherr[26530]: Assert error in vslc_vtx_next(), vsl_dispatch.c line 285:
May 11 20:12:35 cp3058 varnish-frontend-fetcherr[26530]:   Condition(c->offset <= c->vtx->len) not true.

That looks like output from varnishlog, which is executed by BaseVarnishLogConsumer. We likely need to add some sort of error handling to main().

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.May 19 2020, 7:13 AM
ema moved this task from Backlog to Caching on the Traffic board.

Change 597225 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishlog: exit if process terminates

https://gerrit.wikimedia.org/r/597225

Change 597225 merged by Ema:
[operations/puppet@production] varnishlog: exit if process terminates

https://gerrit.wikimedia.org/r/597225

ema claimed this task.

Checking poll() and exiting when it returns something else than None seems to have fixed things. Tested by shooting varnishlog on cp3056:

May 19 08:46:12 cp3056 systemd[1]: varnish-frontend-fetcherr.service: Main process exited, code=exited, status=245/n/a
May 19 08:46:12 cp3056 systemd[1]: varnish-frontend-fetcherr.service: Failed with result 'exit-code'.
May 19 08:46:12 cp3056 systemd[1]: varnish-frontend-fetcherr.service: Consumed 10.113s CPU time.
May 19 08:46:17 cp3056 systemd[1]: varnish-frontend-fetcherr.service: Service RestartSec=5s expired, scheduling restart.
May 19 08:46:17 cp3056 systemd[1]: varnish-frontend-fetcherr.service: Scheduled restart job, restart counter is at 1.
May 19 08:46:17 cp3056 systemd[1]: Stopped Varnish frontend Fetch Errors.
May 19 08:46:17 cp3056 systemd[1]: varnish-frontend-fetcherr.service: Consumed 10.113s CPU time.
May 19 08:46:17 cp3056 systemd[1]: Started Varnish frontend Fetch Errors.
ssingh subscribed.
Apr 06 14:27:14 cp3064 varnishkafka[1513247]:   Condition(c->offset <= c->vtx->len) not true.
Apr 06 14:27:14 cp3064 systemd[1]: varnishkafka-eventlogging.service: Main process exited, code=killed, status=6/ABRT
Apr 06 14:27:14 cp3064 systemd[1]: varnishkafka-eventlogging.service: Failed with result 'signal'.
Apr 06 14:27:15 cp3064 systemd[1]: varnishkafka-eventlogging.service: Scheduled restart job, restart counter is at 5.
Apr 06 14:27:15 cp3064 systemd[1]: Stopped VarnishKafka eventlogging.
Apr 06 14:27:15 cp3064 systemd[1]: Started VarnishKafka eventlogging.
Apr 06 14:27:15 cp3064 varnishkafka[1513293]: Log overrun
Apr 06 14:27:15 cp3064 varnishkafka[1513293]: Log reacquired
Apr 06 14:27:15 cp3064 varnishkafka[1513293]: Log overrun
Apr 06 14:27:16 cp3064 varnishkafka[1513293]: Log reacquired
Apr 06 14:27:16 cp3064 varnishkafka[1513293]: Log overrun
Apr 06 14:27:16 cp3064 varnishkafka[1513293]: Log reacquired
Apr 06 14:27:16 cp3064 varnishkafka[1513293]: Assert error in vslc_vtx_next(), vsl_dispatch.c line 290:
Apr 06 14:27:16 cp3064 varnishkafka[1513293]:   Condition(c->offset <= c->vtx->len) not true.
Apr 06 14:27:16 cp3064 systemd[1]: varnishkafka-eventlogging.service: Main process exited, code=killed, status=6/ABRT
Apr 06 14:27:16 cp3064 systemd[1]: varnishkafka-eventlogging.service: Failed with result 'signal'.
Apr 06 14:27:16 cp3064 systemd[1]: varnishkafka-eventlogging.service: Scheduled restart job, restart counter is at 6.
Apr 06 14:27:16 cp3064 systemd[1]: Stopped VarnishKafka eventlogging.
Apr 06 14:27:16 cp3064 systemd[1]: varnishkafka-eventlogging.service: Start request repeated too quickly.
Apr 06 14:27:16 cp3064 systemd[1]: varnishkafka-eventlogging.service: Failed with result 'signal'.
Apr 06 14:27:16 cp3064 systemd[1]: Failed to start VarnishKafka eventlogging.
Apr 06 14:31:06 cp3064 systemd[1]: Started VarnishKafka eventlogging.

This happened today again, so we should probably check the reason for the resurgence of this error.

Removing inactive task assignee as this task got reopened after 3 years

Change 910005 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M

https://gerrit.wikimedia.org/r/910005

Change 910005 merged by Ssingh:

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M

https://gerrit.wikimedia.org/r/910005

BCornwall removed a project: SRE.
BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Change 917346 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (ulsfo)

https://gerrit.wikimedia.org/r/917346

Change 917346 merged by Ssingh:

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (ulsfo)

https://gerrit.wikimedia.org/r/917346

Mentioned in SAL (#wikimedia-operations) [2023-05-08T14:37:05Z] <sukhe> sudo cumin -b1 -s1200 'A:cp and A:ulsfo' 'varnish-frontend-restart': T253093

Change 917878 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (esams)

https://gerrit.wikimedia.org/r/917878

Change 917878 merged by Ssingh:

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (esams)

https://gerrit.wikimedia.org/r/917878

Mentioned in SAL (#wikimedia-operations) [2023-05-09T13:42:54Z] <sukhe> sudo cumin -b1 -s1200 'A:cp and A:esams' 'varnish-frontend-restart: T253093

Change 918507 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (drmrs)

https://gerrit.wikimedia.org/r/918507

Change 918507 merged by Ssingh:

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (drmrs)

https://gerrit.wikimedia.org/r/918507

Mentioned in SAL (#wikimedia-operations) [2023-05-10T16:02:20Z] <sukhe> sudo cumin -b1 -s1200 'A:cp and A:drmrs' 'varnish-frontend-restart': T253093

Change 919074 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (codfw)

https://gerrit.wikimedia.org/r/919074

Change 919074 merged by Ssingh:

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (codfw)

https://gerrit.wikimedia.org/r/919074

Mentioned in SAL (#wikimedia-operations) [2023-05-11T14:14:58Z] <sukhe> sudo cumin -b1 -s1200 'A:cp and A:codfw' 'varnish-frontend-restart': T253093

Change 919327 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (eqiad)

https://gerrit.wikimedia.org/r/919327

Change 919327 merged by Ssingh:

[operations/puppet@production] varnish: bump size of varnish shared memory log to 160M (eqiad)

https://gerrit.wikimedia.org/r/919327

Mentioned in SAL (#wikimedia-operations) [2023-05-12T13:22:45Z] <sukhe> sudo cumin -b1 -s1200 'A:cp and A:eqiad' 'varnish-frontend-restart': T253093

We have bumped the size of the shared memory log and rolled out the changes to all sites by restarting varnish-frontend. Marking this as closed.

Change 958923 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] varnish: unify vsl_size to new default

https://gerrit.wikimedia.org/r/958923

Change 958923 merged by BBlack:

[operations/puppet@production] varnish: unify vsl_size to new default

https://gerrit.wikimedia.org/r/958923