Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Mar 8 2019, 12:54 PM

Description

Increase in traffic, believed to be purges (those seem recurring and unrelated):

cp1077 was depooled just in case:

Details

	Subject	Repo	Branch	Lines +/-
	varnish-backend-restart: do not spam cron	operations/puppet	production	+6 -3
	varnish-backend-restart: log execution to syslog	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects

Mentioned In: T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater
T217896: Make the user agent configurable for Wikidata Query Service Updater
Mentioned Here: T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish
T145661: varnish backends start returning 503s after ~6 days uptime
T175803: Text eqiad varnish 503 spikes
T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache)

Event Timeline

• jcrespo created this task.Mar 8 2019, 12:54 PM

Restricted Application added a project: SRE. · View Herald TranscriptMar 8 2019, 12:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• jcrespo updated the task description. (Show Details)Mar 8 2019, 1:03 PM

Screenshot 2019-03-08 at 14.05.19.png (724×1 px, 86 KB)

it looks indeed like purge requests

In T217893#5011033, @Vgutierrez wrote:

it looks indeed like purge requests

I updated the comment- Those seem recurring, there was another at 8 causing a traffic increase, but no errors then.

The spike of PURGE requests to the Varnish text frontends seems to be recurring. A view over 24 hours from https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=7&fullscreen&orgId=1&from=now-24h&to=now

varnishPURGEspikes.png (737×1 px, 63 KB)

So I guess we can dismiss those PURGE as being the root cause.

cp1077 effectively depooled at 13:09 UTC

• jcrespo renamed this task from Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing to Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing.Mar 8 2019, 1:40 PM

Gehel mentioned this in T217896: Make the user agent configurable for Wikidata Query Service Updater.Mar 8 2019, 1:43 PM

Gehel mentioned this in T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater.Mar 8 2019, 1:54 PM

At the time of the issue, cp1077 was failing to fetch objects from its origin servers and was affected heavily by mbox lag.

Screenshot from 2019-03-12 09-30-53.png (1×2 px, 232 KB)

varnish-be was not really managing to evict objects at that time.

Screenshot from 2019-03-12 09-30-29.png (1×2 px, 201 KB)

This looks very much like the known Varnish scalability issue we've been dealing with for some time and trying to work around by restarting varnish.service in cron. See, among others: T145661, T175803, and T181315. cp1077 was just a few hours away from the by-weekly varnish restart, which was scheduled at 18:52 UTC.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T08:46:33Z] <ema> cp1077 repooled T217893

Mentioned in SAL (#wikimedia-operations) [2019-03-12T08:48:07Z] <ema> restart varnish-be on cp1077 T217893

Mentioned in SAL (#wikimedia-operations) [2019-03-12T08:50:20Z] <ema> cp1077 depooled again T217893

Re-pooling the service caused the issue to show up again. For some reason, the cron jobs restarting varnish.service do not seem to have worked. Although cron logged two restarts, one on Mar 08 and one on Mar 12:

Mar 08 18:52:01 cp1077 CRON[275180]: (root) CMD (/usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart > /dev/null)
Mar 12 06:52:01 cp1077 CRON[7582]: (root) CMD (/usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart > /dev/null)

The number of cached backend objects did not decrease, indicating that the service was not actually restarted.

Screenshot from 2019-03-12 10-01-00.png (1×2 px, 189 KB)

The journal confirms:

root@cp1077:~# journalctl --since='8 days ago' -u varnish.service | grep -B1 'Starting varnish' 
Mar 05 06:52:44 cp1077 systemd[1]: Stopped varnish (Varnish HTTP Accelerator).
Mar 05 06:53:30 cp1077 systemd[1]: Starting varnish (Varnish HTTP Accelerator)...
--
Mar 12 08:48:54 cp1077 systemd[1]: Stopped varnish (Varnish HTTP Accelerator).
Mar 12 08:49:44 cp1077 systemd[1]: Starting varnish (Varnish HTTP Accelerator)...

Change 495862 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish-backend-restart: log execution to syslog

https://gerrit.wikimedia.org/r/495862

gerritbot added a project: Patch-For-Review.Mar 12 2019, 10:25 AM

In T217893#5016736, @ema wrote:

Re-pooling the service caused the issue to show up again. For some reason, the cron jobs restarting varnish.service do not seem to have worked.

The reason is that cp1077 was depooled on Mar 08 at 13:09, and varnish-backend-restart exits silently if the service is depooled. Hence, the 18:52 cron job did not restart varnish.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T11:36:33Z] <ema> cp1077: repool varnish-be after service restart T217893

• ema moved this task from Backlog to Caching on the Traffic board.Mar 12 2019, 2:07 PM

Change 495862 merged by Ema:
[operations/puppet@production] varnish-backend-restart: log execution to syslog

https://gerrit.wikimedia.org/r/495862

Change 495936 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish-backend-restart: do not spam cron

https://gerrit.wikimedia.org/r/495936

Change 495936 merged by Ema:
[operations/puppet@production] varnish-backend-restart: do not spam cron