varnish backends start returning 503s after ~6 days uptime
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Sep 14 2016, 4:59 PM

Description

We are currently debugging an issue with our cache_upload varnish backends, recently converted to Varnish 4: T131502.

The bug has to do with the file storage backend to which we had to migrate given that the persistent storage backend has been deprecated and we found it to be very unstable.

When the problem kicks in, the backend varnishd responds with 503s to all requests that trigger saving an object into file storage. Cache hits do not seem to be affected.

The following varnishlog entries are common to all 503s returned when varnish is in such state:

-- ExpKill       LRU_Fail
-- FetchError    Could not get storage

We've captured lots of such errors, an example can be found here: https://phabricator.wikimedia.org/P4033

While trying to mitigate the issue, we've tried multiple strategies including:

Bump nuke_limit from 50 to 1000 and lru_interval from 2s to 31s
Increase hit rate

That seemed to help for a while, but the issue happened again today at 10:14 UTC and on cp4005, and it lasted till 10:25 UTC when I've depooled the host and restarted varnishd.
The ganglia view for that timeframe might be helpful to find varnish metrics correlating with fetch_failed behavior.

Details

Subject	Repo	Branch	Lines +/-
cache: move check_varnish_expiry_mailbox_lag to backend profile	operations/puppet	production	+19 -19
VCL: Exp cache admission policy for varnish-be	operations/puppet	production	+29 -0
VCL: stabilize backend storage patterns	operations/puppet	production	+64 -4
cache_upload: rebalance storage bins	operations/puppet	production	+5 -5
cache: stop expiry thread RT experiment on cp2024	operations/puppet	production	+0 -1
cache: hiera flag to enable expiry thread RT experiment	operations/puppet	production	+9 -2
4.1.5-1wm3: add exp_thread_rt and exp_lck_inherit	operations/debs/varnish4	debian-wmf	+63 -11
4.1.5-1wm2: add 0006-exp-thread-realtime.patch	operations/debs/varnish4	debian-wmf	+101 -0
varnish: set LimitRTPRIO=infinity in systemd unit file	operations/puppet	production	+1 -0
cache_upload: increase small object threshold to 1024B	operations/puppet	production	+3 -3
cache_upload: do not cache tiny objects at the backend layer	operations/puppet	production	+71 -2
varnish: increase check_varnish_expiry_mailbox_lag alerting threshold	operations/puppet	production	+1 -1
varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds	operations/puppet	production	+6 -8
varnish: icinga check for expiry mailbox lag	operations/puppet	production	+46 -0
upload storage: avoid cron restarts while rebooting	operations/puppet	production	+1 -12
upload storage: weekly restarts	operations/puppet	production	+1 -1
cache_upload: removed unused hieradata key	operations/puppet	production	+0 -1
cache_upload: storage experiment is the new normal	operations/puppet	production	+17 -71
cache_upload: re-enable daily backend restarts	operations/puppet	production	+6 -13
upload storage: finish ulsfo (cp4014+cp4015)	operations/puppet	production	+1 -7
upload storage: transition cp4007+4013	operations/puppet	production	+2 -0
upload storage: transition cp4005+4006	operations/puppet	production	+2 -0
upload storage: finish esams (cp3048+cp3049)	operations/puppet	production	+1 -10
upload storage: transition cp3046+cp3047	operations/puppet	production	+2 -0
upload storage: transition cp3044+cp3045	operations/puppet	production	+2 -0
upload storage: transition cp3038+cp3039	operations/puppet	production	+2 -0
upload storage: transition cp3036+cp3037	operations/puppet	production	+2 -0
upload storage: transition cp3034+cp3035	operations/puppet	production	+2 -0
upload storage: finish up codfw (cp2024+cp2026)	operations/puppet	production	+1 -8
upload storage: transition cp2020+cp2022	operations/puppet	production	+2 -0
upload storage: transition cp2014+cp2017	operations/puppet	production	+2 -0
upload storage: transition cp2008+cp2011	operations/puppet	production	+2 -0
upload storage: fix CL comparisons	operations/puppet	production	+9 -10
upload storage size class experiment	operations/puppet	production	+75 -20
varnish: apply nuke_limit and lru_interval bumps to upload only	operations/puppet	production	+4 -3

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		BBlack	T145661 varnish backends start returning 503s after ~6 days uptime
		Duplicate		BBlack	T174932 Recurrent 'mailbox lag' critical alerts and 500s

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 312489 had a related patch set uploaded (by Ema):
upload storage: transition cp4007 4013

https://gerrit.wikimedia.org/r/312489

Change 312490 had a related patch set uploaded (by Ema):
upload storage: finish ulsfo (cp4014 cp4015)

https://gerrit.wikimedia.org/r/312490

Change 312488 merged by Ema:
upload storage: transition cp4005 4006

https://gerrit.wikimedia.org/r/312488

Change 312489 merged by Ema:
upload storage: transition cp4007 4013

https://gerrit.wikimedia.org/r/312489

Change 312490 merged by Ema:
upload storage: finish ulsfo (cp4014 cp4015)

https://gerrit.wikimedia.org/r/312490

Change 312510 had a related patch set uploaded (by BBlack):
cache_upload: re-enable daily backend restarts

https://gerrit.wikimedia.org/r/312510

Change 312511 had a related patch set uploaded (by BBlack):
cache_upload: storage experiment is the new normal

https://gerrit.wikimedia.org/r/312511

Change 312512 had a related patch set uploaded (by BBlack):
cache_upload: removed unused hieradata key

https://gerrit.wikimedia.org/r/312512

Change 312510 merged by BBlack:
cache_upload: re-enable daily backend restarts

https://gerrit.wikimedia.org/r/312510

Change 312511 merged by BBlack:
cache_upload: storage experiment is the new normal

https://gerrit.wikimedia.org/r/312511

Change 312512 merged by BBlack:
cache_upload: removed unused hieradata key

https://gerrit.wikimedia.org/r/312512

Nemo_bis subscribed.Sep 23 2016, 10:48 PM

BBlack moved this task from Traffic team actively servicing to Backlog on the Traffic board.Sep 30 2016, 1:20 PM

• ema moved this task from Backlog to Varnish v4 on the Traffic board.Sep 30 2016, 2:43 PM

BBlack merged a task: T144257: Certain images failing to load in ulsfo.Oct 3 2016, 1:31 PM

BBlack added subscribers: jrbs, • MZMcBride, JasperStPierre and 6 others.

Change 313802 had a related patch set uploaded (by BBlack):
upload storage: weekly restarts

https://gerrit.wikimedia.org/r/313802

Change 313802 merged by BBlack:
upload storage: weekly restarts

https://gerrit.wikimedia.org/r/313802

Change 314560 had a related patch set uploaded (by Ema):
upload storage: avoid cron restarts while rebooting

https://gerrit.wikimedia.org/r/314560

Change 314560 merged by BBlack:
upload storage: avoid cron restarts while rebooting

https://gerrit.wikimedia.org/r/314560

• ema mentioned this in T149784: varnish-backend: weekly cron restart for all clusters.Nov 2 2016, 9:30 AM

Change 337808 had a related patch set uploaded (by Ema):
varnish: icinga check for expiry mailbox lag

https://gerrit.wikimedia.org/r/337808

Change 337808 merged by Ema:
varnish: icinga check for expiry mailbox lag

https://gerrit.wikimedia.org/r/337808

Change 338123 had a related patch set uploaded (by Ema):
varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds

https://gerrit.wikimedia.org/r/338123

Change 338123 merged by Ema:
varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds

https://gerrit.wikimedia.org/r/338123

Change 340475 had a related patch set uploaded (by Ema):
[operations/puppet] varnish: increase check_varnish_expiry_mailbox_lag alerting threshold

https://gerrit.wikimedia.org/r/340475

Change 340475 merged by Ema:
varnish: increase check_varnish_expiry_mailbox_lag alerting threshold

https://gerrit.wikimedia.org/r/340475

This has started happening on cache_text as well.

First occurrance I'm aware of on text was on 2017-03-30 between 07:05 and 07:10. The 503s came from cp3040's varnish-be, which was running since 6 days at that point. Expiry mailbox lag spiked on cp3040 during that timeframe.

Today (2017-03-31) the issue happened again, with the spikes due to cp3032 and cp3041. Varnish-be running on cp3032 since 5 days and on cp3041 very close to weekely cron restart (which has been triggered at the time of writing).

BBlack moved this task from Varnish v4 to Caching on the Traffic board.Apr 18 2017, 6:30 PM

Looks like the overwhelming majority of objects ending up in bin0 (thus smaller than 16k) is made of items smaller than 1676 bytes. I've ran the following to get a rough idea of the distribution (note the head -n 100000):

$ zgrep -h '"http_method":"GET","uri_host":"upload.wikimedia.org"' /srv/log/webrequest/archive/sampled-1000.json-20170418.gz |grep '"http_status":"200"'|grep -v '"response_size":0,'|jq -r -c .response_size | head -n 100000 | awk '$1 < 16384' | python ~ema/histogram.py 
# NumSamples = 75818; Min = 42.00; Max = 16383.00
# Mean = 4495.227822; Variance = 21975160.477061; SD = 4687.767110; Median 2432.500000
# each ∎ represents a count of 450
   42.0000 -  1676.1000 [ 33785]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1676.1000 -  3310.2000 [  7752]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 3310.2000 -  4944.3000 [  6514]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 4944.3000 -  6578.4000 [  5699]: ∎∎∎∎∎∎∎∎∎∎∎∎
 6578.4000 -  8212.5000 [  4572]: ∎∎∎∎∎∎∎∎∎∎
 8212.5000 -  9846.6000 [  4262]: ∎∎∎∎∎∎∎∎∎
 9846.6000 - 11480.7000 [  3886]: ∎∎∎∎∎∎∎∎
11480.7000 - 13114.8000 [  3513]: ∎∎∎∎∎∎∎
13114.8000 - 14748.9000 [  3121]: ∎∎∎∎∎∎
14748.9000 - 16383.0000 [  2714]: ∎∎∎∎∎∎

The vast majority of objects ending up in bin0 (<16K) are actually very small: most are < 1.5K. Thus, they might be causing contention/fragmentation due to the size mismatch between e.g. CL ~150B and CL ~15K. We could lower the bar for objects entering bin0, putting for instance only objects <1K in there and so forth, but at this point it's not clear how beneficial it is to cache those small objects in the backends. If they're popular they should be already cached at the frontend layer, especially given their small size.

So the idea now is to avoid caching objects with size < N at the backend layer. We can start with N=256B or so at first, make sure this doesn't cause problems, and perhaps incrementally bump N up to ~1K.

However, deciding not to cache a certain category of objects means that we have to think of how to deal with serialization of concurrent requests. By default, varnish doesn't hit the origin server with multiple requests when concurrent requests for the same object come in. Instead, in the common case, it puts the concurrent requests "on hold", issues one single request to the origin server, caches the response and returns the object to all clients whose requests were awaiting.

If we want to avoid caching certain objects, in this case those with CL < N, this smart behavior of varnish becomes detrimental when multiple requests for the same, not-to-be-cached object come in. Varnish has a mechanism called hit-for-pass which is designed to solve exactly this problem. If a certain object should not be cached, create a hfp object instead to remember the no-cache decision and avoid request serialization by sending concurrent requests straight to the origin server.

One solution would thus be simply using hfp for objects with CL < N. However, varnish stores hfp objects in a separate storage area called 'Transient storage', which could become bloated on backends handling requests for many objects with CL < N. The initial idea was to avoid hfp on the backend and use random backend selection on the frontend instead to avoid chashing multiple requests for the same object to one single backend and reduce the likelihood of serialization. Upon further reflection, this is impossible given that when the request comes in at the frontend layer we don't yet know the response CL. So the alternatives at this point are simple: either we use hfp or we don't at the backend layer. Avoiding serialization would be the pro of using hfp, with possible transient storage bloat as the con.

Change 349188 had a related patch set uploaded (by Ema):
[operations/puppet@production] cache_upload: don't cache tiny objects at the backend layer

https://gerrit.wikimedia.org/r/349188

Change 349188 merged by Ema:
[operations/puppet@production] cache_upload: do not cache tiny objects at the backend layer

https://gerrit.wikimedia.org/r/349188

jrbs unsubscribed.Apr 20 2017, 3:50 PM

Change 349389 had a related patch set uploaded (by Ema):
[operations/puppet@production] cache_upload: increase small object threshold to 1024B

https://gerrit.wikimedia.org/r/349389

I've analyzed 1 minute of frontend traffic on cp2011: the hitrate for objects with CL < 1024 is 97.7%. It should be fine to increase N to 1024.

For reference, the procedure I've used to come up with the hitrate was:

$ timeout 60 varnishncsa -n frontend -q 'ReqMethod eq "GET" and RespStatus eq 200' -F '%{Content-Length}o %{Age}o %{X-Cache-Int}o' | awk '$1 > 0 && $1 < 1024' | tee 1024-frontend-hitrate.log
[...]
$ awk '{print $NF}' 1024-frontend-hitrate.log | grep -c hit
3927
$ awk '{print $NF}' 1024-frontend-hitrate.log | grep -c miss
60
$ python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> hit = 3927 ; miss = 60
>>> 100 * (hit / (0.0001 + hit + miss))
98.49510663418343

Hitrate is even a bit higher now. :)

Change 349389 merged by Ema:
[operations/puppet@production] cache_upload: increase small object threshold to 1024B

https://gerrit.wikimedia.org/r/349389

Mentioned in SAL (#wikimedia-operations) [2017-04-21T14:26:24Z] <ema> ban objects with CT < 1024 on codfw cache_upload T145661

@BBlack suggested that the possible underlying issue could be lock contention between the expiry thread and the worker threads. Indeed this seems confirmed by perf and systemtap.

In particular, I've used the following stap probe to instrument the varnishd backend process for 10 seconds and print: thread_id, lock_id, number of times the given thread acquired the lock and average time in ns:

# Adapted from https://sourceware.org/systemtap/examples/process/futexes.stp
#
global exp_tid, lock_waits
global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */
global FUTEX_PRIVATE_FLAG = 128 /* linux 2.6.22+ */
global FUTEX_CLOCK_REALTIME = 256 /* linux 2.6.29+ */

probe process("/usr/sbin/varnishd").statement("exp_thread@cache/cache_expire.c:*") {
    if (pid() != target() || exp_tid) next

    exp_tid = tid()
}

probe syscall.futex.return {
    if (pid() != target()) next
    if ((@entry($op) & ~(FUTEX_PRIVATE_FLAG|FUTEX_CLOCK_REALTIME)) != FUTEX_WAIT) next

    t = tid()
    uaddr = @entry($uaddr)
    elapsed = gettimeofday_us() - @entry(gettimeofday_us())
    lock_waits[t, uaddr] <<< elapsed
}

probe timer.s(10) {
    exit()
}

probe end {
    foreach ([tid+, u_addr] in lock_waits) {
        if (tid == exp_tid)
            printf("exp_")

        printf("tid=%d lock=%p: %d %d us\n", tid, u_addr, @count(lock_waits[tid, u_addr]), @avg(lock_waits[tid, u_addr]))
    }
}

This is the output for the expiry thread on cp2017, where varnish-be has been running for ~3 days:

$ grep '^exp_tid' exp.log | sort -rn -k 3
exp_tid=36166 lock=0x7f31034996a8: 131 67330 us
exp_tid=36166 lock=0x7f3103499748: 18 12128 us
[...]

So in this case the lock acquired more often is 0x7f31034996a8 (131 times). Looking at how often other threads (workers) obtained that lock:

$ grep '^tid' exp.log | awk '/0x7f31034996a8/ {sum+=$3} END { print sum }'
2713

On cp2020, where varnish-be has been restarted 7 hours ago, the lock obtained more often by the expiry thread was obtained 14 times

$ grep '^exp_tid' exp.log | sort -rn -k 3
exp_tid=1412 lock=0x7fed0a8993d8: 14 13 us
exp_tid=1412 lock=0x7fed0a8995b8: 4 5 us
[...]

While in total all the workers obtained that lock only 21 times:

$ grep ^tid exp.log | awk '/0x7fed0a8993d8/ { sum+=$3 } END { print sum }'
21

The current line of thinking is that we could try to change the priority level of the expiry thread to make sure it always goes first when multiple threads are waiting on the same lock. I've prepped this patch https://phabricator.wikimedia.org/P5317 and installed a patched version of varnish on cp1008. Later today we'll try it out on a varnish-be in codfw (with weekly restarts disabled) and see if that helps.

fgiunchedi merged a task: T163610: Some thumbnails / fullscreen images on Commons show either HTTP 503 errors or other issues.Apr 24 2017, 10:56 AM

fgiunchedi mentioned this in T163610: Some thumbnails / fullscreen images on Commons show either HTTP 503 errors or other issues.

fgiunchedi added subscribers: Paladox, zhuyifei1999, Zppix and 3 others.

Mentioned in SAL (#wikimedia-operations) [2017-04-24T14:08:25Z] <ema> re-pooling cp2002's varnish-be with increased priority for expiry thread T145661

Change 349961 had a related patch set uploaded (by Ema):
[operations/puppet@production] varnish: set LimitRTPRIO=infinity in systemd unit file

https://gerrit.wikimedia.org/r/349961

Change 349961 merged by Ema:
[operations/puppet@production] varnish: set LimitRTPRIO=infinity in systemd unit file

https://gerrit.wikimedia.org/r/349961

Change 349965 had a related patch set uploaded (by Ema):
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm2: add 0006-exp-thread-realtime.patch

https://gerrit.wikimedia.org/r/349965

Change 349965 merged by Ema:
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm2: add 0006-exp-thread-realtime.patch

https://gerrit.wikimedia.org/r/349965

Mentioned in SAL (#wikimedia-operations) [2017-04-24T15:07:46Z] <ema> varnish 4.1.5-1wm2 uploaded to apt.w.o T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-24T16:11:13Z] <ema> upgrade cp2017 varnish-be to varnish 4.1.5-1wm2, expiry thread lock/priority workaround T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-24T17:35:15Z] <bblack> upgrade cp2024 varnish-be to varnish 4.1.5-1wm2, expiry thread lock/priority workaround T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-25T07:14:53Z] <ema> upgrade cp3033 varnish-be to varnish 4.1.5-1wm2, expiry thread lock/priority workaround T145661

Change 350520 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm3: add exp_thread_rt and exp_lck_inherit

https://gerrit.wikimedia.org/r/350520

Change 350520 merged by Ema:
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm3: add exp_thread_rt and exp_lck_inherit

https://gerrit.wikimedia.org/r/350520

Mentioned in SAL (#wikimedia-operations) [2017-04-27T07:55:01Z] <ema> varnish 4.1.5-1wm3 uploaded to apt.w.o T145661

Change 350526 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: hiera flag to enable expiry thread RT experiment

https://gerrit.wikimedia.org/r/350526

Change 350526 merged by Ema:
[operations/puppet@production] cache: hiera flag to enable expiry thread RT experiment

https://gerrit.wikimedia.org/r/350526

Some general updates on bin-sizing estimates: Based on the graph data for available bytes in each bin and comparing how fast they initially reach zero after restarts on live servers, we can make a few tweaks. Bins 1 and 2 seem slightly-oversized, while bins 0, 3, and 4 are slightly undersized. Proposed tweak to upload.pp based on the live data would be:

-    $bin0_size = floor($ssm * 0.03)
-    $bin1_size = floor($ssm * 0.20)
-    $bin2_size = floor($ssm * 0.43)
-    $bin3_size = floor($ssm * 0.30)
-    $bin4_size = floor($ssm * 0.04)
+    $bin0_size = floor($ssm * 0.04)
+    $bin1_size = floor($ssm * 0.18)
+    $bin2_size = floor($ssm * 0.40)
+    $bin3_size = floor($ssm * 0.32)
+    $bin4_size = floor($ssm * 0.06)

Rolling this out as a change would be tricky though, we'd need to ensure the storage files were deleted between the daemon's stop and start, instead of letting varnish assume it can mmap/fallocate/ftruncate each as it iterates initializing them. Perhaps it would be wise to hack a storage-wipe into the initscripts anyways for the general case? It might re-set any FS-level oddities that impact perf, makes changes like this easier, etc. The cost is probably slightly longer startup times on restart, but we've got depooling around those events already.

Mentioned in SAL (#wikimedia-operations) [2017-04-27T14:26:35Z] <ema> varnish 4.1.5-1wm4 uploaded to apt.w.o T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-27T14:34:49Z] <ema> upgrade upload-codfw to varnish 4.1.5-1wm4 T145661

EddieGP unsubscribed.Apr 27 2017, 7:04 PM

I've re-run the binning analysis, with a few minor changes:

Fresher data (61 days ending 2017-05-02)
Included maps.wikimedia.org data alongside upload (it also always gives content-length, and we're merging the clusters soon)
I created a new bin covering 0-1K. This is to help sort out the effect of recent optimizations where the backends explicitly refuse to cache sub-1K objects.

These results have 7 bins total. Bins 1-5 below correspond to the original 0-4 that we deployed. Bin 6 is still over-size 1G-16G bin, and the new Bin 0 is 0-1K.

I'm copying the old results from above as well so we don't have to scroll so much for comparison:

OLD:

total size: 2521817293136
Bin 0 - Count: 10261768 Size: 75203013291 AvgSize: 7328.46555203743 SpacePct: 0.0298209602637317
Bin 1 - Count: 8391307 Size: 512972317454 AvgSize: 61131.3967483254 SpacePct: 0.203413751999493
Bin 2 - Count: 1062238 Size: 1097628006519 AvgSize: 1033316.45687595 SpacePct: 0.43525278754594
Bin 3 - Count: 108056 Size: 803513093533 AvgSize: 7436080.30588769 SpacePct: 0.318624626661114
Bin 4 - Count: 216 Size: 30565707048 AvgSize: 141507903 SpacePct: 0.0121205081475154
Bin 5 - Count: 1 Size: 1935155291 AvgSize: 1935155291 SpacePct: 0.000767365382205601

NEW:

bblack@oxygen:~/ulsizes$ zegrep -h '"http_method":"GET","uri_host":"(maps|upload)\.wikimedia\.org"' /srv/log/webrequest/archive/sampled-1000.json-20170*.gz | grep '"http_status":"200"'|grep -v '"response_size":0,'|jq -r -c '[.uri_path,.response_size]'| ./analyze.pl
total size: 2666887824600
Bin 0 - Count: 482066 Size: 273142748 AvgSize: 566.608613758282 SpacePct: 0.000102420036373659
Bin 1 - Count: 11822125 Size: 90469096997 AvgSize: 7652.52414409423 SpacePct: 0.0339230979880338
Bin 2 - Count: 9927706 Size: 609440600749 AvgSize: 61387.8574515603 SpacePct: 0.228521273046199
Bin 3 - Count: 1164538 Size: 1088004151626 AvgSize: 934279.647058318 SpacePct: 0.40796772237287
Bin 4 - Count: 100149 Size: 796129478811 AvgSize: 7949450.1074499 SpacePct: 0.2985237967144
Bin 5 - Count: 573 Size: 79972067391 AvgSize: 139567307.837696 SpacePct: 0.0299870383198419
Bin 6 - Count: 2 Size: 2599286278 AvgSize: 1299643139 SpacePct: 0.000974651522281355

OLD:

total reqs: 198300770
Bin 0 - Reqs: 153783578 ReqPct: 0.775506711345599
Bin 1 - Reqs: 41338117 ReqPct: 0.208461706931345
Bin 2 - Reqs: 2872423 ReqPct: 0.0144851832900094
Bin 3 - Reqs: 306242 ReqPct: 0.00154433086669305
Bin 4 - Reqs: 409 ReqPct: 2.06252350911194e-06
Bin 5 - Reqs: 1 ReqPct: 5.04284476555487e-09

NEW:

bblack@oxygen:~/ulsizes$ zegrep -h '"http_method":"GET","uri_host":"(maps|upload)\.wikimedia\.org"' /srv/log/webrequest/archive/sampled-1000.json-20170*.gz |grep '"http_status":"200"'|gr
ep -v '"response_size":0,'|jq -r -c .response_size|./binhits.pl
total reqs: 196894144
Bin 0 - Reqs: 53246445 ReqPct: 0.270431836713234
Bin 1 - Reqs: 94771124 ReqPct: 0.481330333521753
Bin 2 - Reqs: 45988501 ReqPct: 0.233569673864958
Bin 3 - Reqs: 2660789 ReqPct: 0.0135138046563741
Bin 4 - Reqs: 226064 ReqPct: 0.00114814994193022
Bin 5 - Reqs: 1219 ReqPct: 6.1911440088335e-06
Bin 6 - Reqs: 2 ReqPct: 1.0157742426306e-08

Reformatting this a bit for comparison using the old Bin numbers, and calling the new data's Bin0 "Bin-1" (and then also including a combined figure in Bin0 as well for comparison).

Storage Size Percentages
(how much total space do unique objects within each bin take, relative to the whole):

Bin	SizeRange	OLD StoragePct	NEW StoragePct
Bin-1	(0-1K)	Bin0	.01%
Bin0	(1K-16K)	3.0%	3.4% (3.41% w/ Bin-1)
Bin1	(16K-256K)	20.3%	22.9%
Bin2	(256K-4M)	43.5%	40.8%
Bin3	(4M-64M)	31.9%	29.6%
Bin4	(64M-1G)	1.2%	3.0%

Request Percentages
(how many total requests do we get for objects within each bin, relative to the whole):

Bin	SizeRange	OLD ReqPct	NEW ReqPct
Bin-1	(0-1K)	Bin0	27.0%
Bin0	(1K-16K)	77.6%	48.1% (75.2% w/ Bin-1)
Bin1	(16K-256K)	20.8%	23.4%
Bin2	(256K-4M)	1.4%	1.4%
Bin3	(4M-64M)	0.2%	0.1%
Bin4	(64M-1G)	~0%	~0%

Byte Load Percentages
(how many bytes do we transfer per time slice from each bin, relative to the whole - should be a proxy for disk i/o load):

Bin	SizeRange	OLD XferPct	NEW XferPct
Bin-1	(0-1K)	Bin0	0.4%
Bin0	(1K-16K)	12.6%	9.0% (9.4% w/ Bin-1)
Bin1	(16K-256K)	28.2%	35.1%
Bin2	(256K-4M)	33.1%	30.0%
Bin3	(4M-64M)	25.4%	22.4%
Bin4	(64M-1G)	0.6%	2.1%

Today, 2017-05-04, this issue affected 4 out of 8 of the text-esams hosts roughly at the same time, resulting in a peak of ~600 failed requests per second.

host	running since	fetch failed timeframe	mailbox lag timeframe	peak mailbox lag	cached objects
cp3041	2017-04-28 06:45:16	07:03 - 07:20	06:58 - 07:21	658K	15.1M
cp3043	2017-04-29 04:41	06:58 - 07:14	06:55 - 07:14	653K	16.1M
cp3032	2017-04-29 21:57:19	06:53 - 07:06	06:50 - 07:07	552K	14.6M
cp3030	2017-04-30 14:45:21	06:41 - 06:49	06:37 - 06:50	327K	14.1M

Note that @Joe manually restarted varnish-be on cp3043 at 07:14 (thanks!). All other varnish backends managed to recover on their own.

There's a correlation between the beginning of fetch failures and the time when the host reaches thread_pool_max (12k) backend threads. See e.g. cp3041.

Unaffected hosts had been running for 2 days ago or less when the issues occurred, with the exception of cp3040:

host	running since	cached objects
cp3040	2017-04-27 08:40:05	23.8M
cp3042	2017-05-01 13:09:21	10.9M
cp3031	2017-05-02 11:33:21	9.4M
cp3033	2017-05-03 09:57:17	7.8M

So the surprise here seems to be cp3040. It only had a brief 503 spike around 07:15 despite its long uptime and then quickly recovered. The weekly cron job restarted its varnish-be about one hour later, at 08:21.

Change 351802 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: stop expiry thread RT experiment on cp2024

https://gerrit.wikimedia.org/r/351802

Change 351802 merged by Ema:
[operations/puppet@production] cache: stop expiry thread RT experiment on cp2024

https://gerrit.wikimedia.org/r/351802

Mentioned in SAL (#wikimedia-operations) [2017-05-09T17:25:57Z] <elukey> executing varnish-backend-restart on cp1074 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661

Mentioned in SAL (#wikimedia-operations) [2017-05-09T17:29:20Z] <elukey> executing varnish-backend-restart on cp1072 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661

• Mattflaschen-WMF unsubscribed.May 9 2017, 5:41 PM

This is what we had before (copied from far above):

Bin	SizeRange	StoragePct	StorageSize (1 node)	Disk
Bin0	(0-16K)	3%	21.6G	A
Bin1	(16K-256K)	20%	144G	B
Bin2	(256K-4M)	43%	309.6G	A
Bin3	(4M-64M)	30%	216G	B
Bin4	(64M-1G)	4%	28.8G	A

Based on the data from T145661#3233742 and going through the same basic pragmatic transformations, this seems like a good real-world set of values to switch to:

Bin	SizeRange	StoragePct	StorageSize (1 node)	Disk
Bin0	(0-16K)	4%	28.8G	A
Bin1	(16K-256K)	23%	165.6G	B
Bin2	(256K-4M)	40%	288.0G	A
Bin3	(4M-64M)	27%	194.4G	B
Bin4	(64M-1G)	6%	43.2G	A

Change 356394 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cache_upload: rebalance storage bins

https://gerrit.wikimedia.org/r/356394

Change 356394 merged by Ema:
[operations/puppet@production] cache_upload: rebalance storage bins

https://gerrit.wikimedia.org/r/356394

fgiunchedi mentioned this in T174932: Recurrent 'mailbox lag' critical alerts and 500s.Sep 4 2017, 1:03 PM

fgiunchedi created subtask T174932: Recurrent 'mailbox lag' critical alerts and 500s.

TheresNoTime subscribed.Sep 10 2017, 1:50 PM

fgiunchedi mentioned this in T175803: Text eqiad varnish 503 spikes.Sep 13 2017, 9:18 AM

Change 376751 had a related patch set uploaded (by Ema; owner: BBlack):
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

Krinkle subscribed.Sep 18 2017, 4:39 PM

Change 376751 merged by BBlack:
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

William_Avery subscribed.Sep 20 2017, 8:12 AM

Trizek-WMF subscribed.Sep 20 2017, 9:22 AM

• Elitre subscribed.Sep 20 2017, 10:25 AM

Change 379512 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

The cache admission policy change seems to have gotten us over this for now. We should probably wait for the Varnish5 upgrade ( T168529 ) before trying to undo any of the other related work (e.g. weekly restart crons or storage splitting).

BBlack closed subtask T174932: Recurrent 'mailbox lag' critical alerts and 500s as Resolved.Oct 10 2017, 3:56 PM

Change 379512 abandoned by Ema:
VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

• ema reopened subtask T174932: Recurrent 'mailbox lag' critical alerts and 500s as Open.Mar 20 2018, 11:25 AM

Krinkle mentioned this in T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache).Mar 29 2018, 1:18 AM

• ema mentioned this in T209021: ATS backend-side request-mangling.Dec 6 2018, 12:40 PM

• ema mentioned this in T217893: Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing.Mar 12 2019, 8:40 AM

• ema mentioned this in T221731: cp4021 - UNKNOWN: cannot run varnishstat.Apr 24 2019, 8:56 AM

Change 506090 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: move check_varnish_expiry_mailbox_lag to backend profile

https://gerrit.wikimedia.org/r/506090

Change 506090 merged by Ema:
[operations/puppet@production] cache: move check_varnish_expiry_mailbox_lag to backend profile

https://gerrit.wikimedia.org/r/506090

BBlack mentioned this in T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).Nov 24 2020, 10:24 PM

	F4472198: fetch_304-2.png
	Sep 15 2016, 4:46 PM

	F4472193: fetch_failed-1.png
	Sep 15 2016, 4:46 PM

	F4472191: fetch_304-1.png
	Sep 15 2016, 4:46 PM

	F4472200: fetch_failed-2.png
	Sep 15 2016, 4:46 PM

varnish backends start returning 503s after ~6 days uptimeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

varnish backends start returning 503s after ~6 days uptime
Closed, ResolvedPublic
Actions

Related Objects
Search...