Page MenuHomePhabricator

Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing
Closed, ResolvedPublic

Description


Increase in traffic, believed to be purges (those seem recurring and unrelated):

cp1077 was depooled just in case:

Details

Related Gerrit Patches:
operations/puppet : productionvarnish-backend-restart: do not spam cron
operations/puppet : productionvarnish-backend-restart: log execution to syslog

Event Timeline

jcrespo created this task.Mar 8 2019, 12:54 PM
Restricted Application added a project: Operations. · View Herald TranscriptMar 8 2019, 12:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo updated the task description. (Show Details)Mar 8 2019, 1:03 PM

it looks indeed like purge requests

jcrespo updated the task description. (Show Details)Mar 8 2019, 1:06 PM

it looks indeed like purge requests

I updated the comment- Those seem recurring, there was another at 8 causing a traffic increase, but no errors then.

hashar added a comment.Mar 8 2019, 1:08 PM

The spike of PURGE requests to the Varnish text frontends seems to be recurring. A view over 24 hours from https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=7&fullscreen&orgId=1&from=now-24h&to=now

So I guess we can dismiss those PURGE as being the root cause.

cp1077 effectively depooled at 13:09 UTC

jcrespo renamed this task from Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing to Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing.Mar 8 2019, 1:40 PM
ema triaged this task as Medium priority.EditedMar 12 2019, 8:40 AM

At the time of the issue, cp1077 was failing to fetch objects from its origin servers and was affected heavily by mbox lag.

varnish-be was not really managing to evict objects at that time.

This looks very much like the known Varnish scalability issue we've been dealing with for some time and trying to work around by restarting varnish.service in cron. See, among others: T145661, T175803, and T181315. cp1077 was just a few hours away from the by-weekly varnish restart, which was scheduled at 18:52 UTC.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T08:46:33Z] <ema> cp1077 repooled T217893

Mentioned in SAL (#wikimedia-operations) [2019-03-12T08:48:07Z] <ema> restart varnish-be on cp1077 T217893

Mentioned in SAL (#wikimedia-operations) [2019-03-12T08:50:20Z] <ema> cp1077 depooled again T217893

ema added a comment.EditedMar 12 2019, 9:04 AM

Re-pooling the service caused the issue to show up again. For some reason, the cron jobs restarting varnish.service do not seem to have worked. Although cron logged two restarts, one on Mar 08 and one on Mar 12:

Mar 08 18:52:01 cp1077 CRON[275180]: (root) CMD (/usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart > /dev/null)
Mar 12 06:52:01 cp1077 CRON[7582]: (root) CMD (/usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart > /dev/null)

The number of cached backend objects did not decrease, indicating that the service was not actually restarted.

The journal confirms:

root@cp1077:~# journalctl --since='8 days ago' -u varnish.service | grep -B1 'Starting varnish' 
Mar 05 06:52:44 cp1077 systemd[1]: Stopped varnish (Varnish HTTP Accelerator).
Mar 05 06:53:30 cp1077 systemd[1]: Starting varnish (Varnish HTTP Accelerator)...
--
Mar 12 08:48:54 cp1077 systemd[1]: Stopped varnish (Varnish HTTP Accelerator).
Mar 12 08:49:44 cp1077 systemd[1]: Starting varnish (Varnish HTTP Accelerator)...

Change 495862 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish-backend-restart: log execution to syslog

https://gerrit.wikimedia.org/r/495862

ema added a comment.Mar 12 2019, 11:33 AM

Re-pooling the service caused the issue to show up again. For some reason, the cron jobs restarting varnish.service do not seem to have worked.

The reason is that cp1077 was depooled on Mar 08 at 13:09, and varnish-backend-restart exits silently if the service is depooled. Hence, the 18:52 cron job did not restart varnish.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T11:36:33Z] <ema> cp1077: repool varnish-be after service restart T217893

ema moved this task from Triage to Caching on the Traffic board.Mar 12 2019, 2:07 PM

Change 495862 merged by Ema:
[operations/puppet@production] varnish-backend-restart: log execution to syslog

https://gerrit.wikimedia.org/r/495862

Change 495936 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish-backend-restart: do not spam cron

https://gerrit.wikimedia.org/r/495936

Change 495936 merged by Ema:
[operations/puppet@production] varnish-backend-restart: do not spam cron

https://gerrit.wikimedia.org/r/495936

ema closed this task as Resolved.Mar 13 2019, 8:48 AM
ema claimed this task.

Closing this as cp1077 is fine and back in service. For the general Varnish scalability issue, the solution we've identified is moving to ATS (ongoing, see T213263).