Page MenuHomePhabricator

Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing
Closed, ResolvedPublic

Assigned To
Authored By
jcrespo
Mar 8 2019, 12:54 PM
Referenced Files
F28373209: Screenshot from 2019-03-12 10-01-00.png
Mar 12 2019, 9:33 AM
F28372980: Screenshot from 2019-03-12 09-30-53.png
Mar 12 2019, 8:40 AM
F28372982: Screenshot from 2019-03-12 09-30-29.png
Mar 12 2019, 8:40 AM
F28347569: varnishPURGEspikes.png
Mar 8 2019, 1:08 PM
F28347567: Screenshot 2019-03-08 at 14.05.19.png
Mar 8 2019, 1:05 PM
F28347557: 5XX.png
Mar 8 2019, 1:03 PM
F28347561: requests.png
Mar 8 2019, 1:03 PM
F28347559: cp1077.png
Mar 8 2019, 1:03 PM

Description

traffic_issues.png (793×1 px, 75 KB)

5XX.png (412×1 px, 46 KB)

Increase in traffic, believed to be purges (those seem recurring and unrelated):

requests.png (1×1 px, 94 KB)

cp1077 was depooled just in case:

cp1077.png (774×1 px, 187 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Screenshot 2019-03-08 at 14.05.19.png (724×1 px, 86 KB)
it looks indeed like purge requests

Screenshot 2019-03-08 at 14.05.19.png (724×1 px, 86 KB)
it looks indeed like purge requests

I updated the comment- Those seem recurring, there was another at 8 causing a traffic increase, but no errors then.

The spike of PURGE requests to the Varnish text frontends seems to be recurring. A view over 24 hours from https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=7&fullscreen&orgId=1&from=now-24h&to=now

varnishPURGEspikes.png (737×1 px, 63 KB)

So I guess we can dismiss those PURGE as being the root cause.

cp1077 effectively depooled at 13:09 UTC

jcrespo renamed this task from Traffic (text) instability due to unknown cause, causing a 1.5-2% requests failing to Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing.Mar 8 2019, 1:40 PM
ema triaged this task as Medium priority.EditedMar 12 2019, 8:40 AM

At the time of the issue, cp1077 was failing to fetch objects from its origin servers and was affected heavily by mbox lag.

Screenshot from 2019-03-12 09-30-53.png (1×2 px, 232 KB)

varnish-be was not really managing to evict objects at that time.

Screenshot from 2019-03-12 09-30-29.png (1×2 px, 201 KB)

This looks very much like the known Varnish scalability issue we've been dealing with for some time and trying to work around by restarting varnish.service in cron. See, among others: T145661, T175803, and T181315. cp1077 was just a few hours away from the by-weekly varnish restart, which was scheduled at 18:52 UTC.

Re-pooling the service caused the issue to show up again. For some reason, the cron jobs restarting varnish.service do not seem to have worked. Although cron logged two restarts, one on Mar 08 and one on Mar 12:

Mar 08 18:52:01 cp1077 CRON[275180]: (root) CMD (/usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart > /dev/null)
Mar 12 06:52:01 cp1077 CRON[7582]: (root) CMD (/usr/local/sbin/run-no-puppet /usr/local/sbin/varnish-backend-restart > /dev/null)

The number of cached backend objects did not decrease, indicating that the service was not actually restarted.

Screenshot from 2019-03-12 10-01-00.png (1×2 px, 189 KB)

The journal confirms:

root@cp1077:~# journalctl --since='8 days ago' -u varnish.service | grep -B1 'Starting varnish' 
Mar 05 06:52:44 cp1077 systemd[1]: Stopped varnish (Varnish HTTP Accelerator).
Mar 05 06:53:30 cp1077 systemd[1]: Starting varnish (Varnish HTTP Accelerator)...
--
Mar 12 08:48:54 cp1077 systemd[1]: Stopped varnish (Varnish HTTP Accelerator).
Mar 12 08:49:44 cp1077 systemd[1]: Starting varnish (Varnish HTTP Accelerator)...

Change 495862 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish-backend-restart: log execution to syslog

https://gerrit.wikimedia.org/r/495862

Re-pooling the service caused the issue to show up again. For some reason, the cron jobs restarting varnish.service do not seem to have worked.

The reason is that cp1077 was depooled on Mar 08 at 13:09, and varnish-backend-restart exits silently if the service is depooled. Hence, the 18:52 cron job did not restart varnish.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T11:36:33Z] <ema> cp1077: repool varnish-be after service restart T217893

Change 495862 merged by Ema:
[operations/puppet@production] varnish-backend-restart: log execution to syslog

https://gerrit.wikimedia.org/r/495862

Change 495936 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish-backend-restart: do not spam cron

https://gerrit.wikimedia.org/r/495936

Change 495936 merged by Ema:
[operations/puppet@production] varnish-backend-restart: do not spam cron

https://gerrit.wikimedia.org/r/495936

ema claimed this task.

Closing this as cp1077 is fine and back in service. For the general Varnish scalability issue, the solution we've identified is moving to ATS (ongoing, see T213263).