Page MenuHomePhabricator

varnish backends start returning 503s after ~6 days uptime
Closed, ResolvedPublic

Assigned To
Authored By
ema
Sep 14 2016, 4:59 PM
Referenced Files
F4472198: fetch_304-2.png
Sep 15 2016, 4:46 PM
F4472193: fetch_failed-1.png
Sep 15 2016, 4:46 PM
F4472191: fetch_304-1.png
Sep 15 2016, 4:46 PM
F4472200: fetch_failed-2.png
Sep 15 2016, 4:46 PM
F4471546: 503-plateaus-cp4007.png
Sep 15 2016, 1:19 PM

Description

We are currently debugging an issue with our cache_upload varnish backends, recently converted to Varnish 4: T131502.

The bug has to do with the file storage backend to which we had to migrate given that the persistent storage backend has been deprecated and we found it to be very unstable.

When the problem kicks in, the backend varnishd responds with 503s to all requests that trigger saving an object into file storage. Cache hits do not seem to be affected.

The following varnishlog entries are common to all 503s returned when varnish is in such state:

-- ExpKill       LRU_Fail
​-- FetchError    Could not get storage

We've captured lots of such errors, an example can be found here: https://phabricator.wikimedia.org/P4033

While trying to mitigate the issue, we've tried multiple strategies including:

That seemed to help for a while, but the issue happened again today at 10:14 UTC and on cp4005, and it lasted till 10:25 UTC when I've depooled the host and restarted varnishd.
The ganglia view for that timeframe might be helpful to find varnish metrics correlating with fetch_failed behavior.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+19 -19
operations/puppetproduction+29 -0
operations/puppetproduction+64 -4
operations/puppetproduction+5 -5
operations/puppetproduction+0 -1
operations/puppetproduction+9 -2
operations/debs/varnish4debian-wmf+63 -11
operations/debs/varnish4debian-wmf+101 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -3
operations/puppetproduction+71 -2
operations/puppetproduction+1 -1
operations/puppetproduction+6 -8
operations/puppetproduction+46 -0
operations/puppetproduction+1 -12
operations/puppetproduction+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+17 -71
operations/puppetproduction+6 -13
operations/puppetproduction+1 -7
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -10
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -8
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+9 -10
operations/puppetproduction+75 -20
operations/puppetproduction+4 -3
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 312489 had a related patch set uploaded (by Ema):
upload storage: transition cp4007 4013

https://gerrit.wikimedia.org/r/312489

Change 312490 had a related patch set uploaded (by Ema):
upload storage: finish ulsfo (cp4014 cp4015)

https://gerrit.wikimedia.org/r/312490

Change 312488 merged by Ema:
upload storage: transition cp4005 4006

https://gerrit.wikimedia.org/r/312488

Change 312489 merged by Ema:
upload storage: transition cp4007 4013

https://gerrit.wikimedia.org/r/312489

Change 312490 merged by Ema:
upload storage: finish ulsfo (cp4014 cp4015)

https://gerrit.wikimedia.org/r/312490

Change 312510 had a related patch set uploaded (by BBlack):
cache_upload: re-enable daily backend restarts

https://gerrit.wikimedia.org/r/312510

Change 312511 had a related patch set uploaded (by BBlack):
cache_upload: storage experiment is the new normal

https://gerrit.wikimedia.org/r/312511

Change 312512 had a related patch set uploaded (by BBlack):
cache_upload: removed unused hieradata key

https://gerrit.wikimedia.org/r/312512

Change 312510 merged by BBlack:
cache_upload: re-enable daily backend restarts

https://gerrit.wikimedia.org/r/312510

Change 312511 merged by BBlack:
cache_upload: storage experiment is the new normal

https://gerrit.wikimedia.org/r/312511

Change 312512 merged by BBlack:
cache_upload: removed unused hieradata key

https://gerrit.wikimedia.org/r/312512

Change 313802 had a related patch set uploaded (by BBlack):
upload storage: weekly restarts

https://gerrit.wikimedia.org/r/313802

Change 313802 merged by BBlack:
upload storage: weekly restarts

https://gerrit.wikimedia.org/r/313802

Change 314560 had a related patch set uploaded (by Ema):
upload storage: avoid cron restarts while rebooting

https://gerrit.wikimedia.org/r/314560

Change 314560 merged by BBlack:
upload storage: avoid cron restarts while rebooting

https://gerrit.wikimedia.org/r/314560

Change 337808 had a related patch set uploaded (by Ema):
varnish: icinga check for expiry mailbox lag

https://gerrit.wikimedia.org/r/337808

Change 337808 merged by Ema:
varnish: icinga check for expiry mailbox lag

https://gerrit.wikimedia.org/r/337808

Change 338123 had a related patch set uploaded (by Ema):
varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds

https://gerrit.wikimedia.org/r/338123

Change 338123 merged by Ema:
varnish: tune check_varnish_expiry_mailbox_lag alerting thresholds

https://gerrit.wikimedia.org/r/338123

Change 340475 had a related patch set uploaded (by Ema):
[operations/puppet] varnish: increase check_varnish_expiry_mailbox_lag alerting threshold

https://gerrit.wikimedia.org/r/340475

Change 340475 merged by Ema:
varnish: increase check_varnish_expiry_mailbox_lag alerting threshold

https://gerrit.wikimedia.org/r/340475

This has started happening on cache_text as well.

First occurrance I'm aware of on text was on 2017-03-30 between 07:05 and 07:10. The 503s came from cp3040's varnish-be, which was running since 6 days at that point. Expiry mailbox lag spiked on cp3040 during that timeframe.

Today (2017-03-31) the issue happened again, with the spikes due to cp3032 and cp3041. Varnish-be running on cp3032 since 5 days and on cp3041 very close to weekely cron restart (which has been triggered at the time of writing).

Looks like the overwhelming majority of objects ending up in bin0 (thus smaller than 16k) is made of items smaller than 1676 bytes. I've ran the following to get a rough idea of the distribution (note the head -n 100000):

$ zgrep -h '"http_method":"GET","uri_host":"upload.wikimedia.org"' /srv/log/webrequest/archive/sampled-1000.json-20170418.gz |grep '"http_status":"200"'|grep -v '"response_size":0,'|jq -r -c .response_size | head -n 100000 | awk '$1 < 16384' | python ~ema/histogram.py 
# NumSamples = 75818; Min = 42.00; Max = 16383.00
# Mean = 4495.227822; Variance = 21975160.477061; SD = 4687.767110; Median 2432.500000
# each ∎ represents a count of 450
   42.0000 -  1676.1000 [ 33785]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1676.1000 -  3310.2000 [  7752]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 3310.2000 -  4944.3000 [  6514]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 4944.3000 -  6578.4000 [  5699]: ∎∎∎∎∎∎∎∎∎∎∎∎
 6578.4000 -  8212.5000 [  4572]: ∎∎∎∎∎∎∎∎∎∎
 8212.5000 -  9846.6000 [  4262]: ∎∎∎∎∎∎∎∎∎
 9846.6000 - 11480.7000 [  3886]: ∎∎∎∎∎∎∎∎
11480.7000 - 13114.8000 [  3513]: ∎∎∎∎∎∎∎
13114.8000 - 14748.9000 [  3121]: ∎∎∎∎∎∎
14748.9000 - 16383.0000 [  2714]: ∎∎∎∎∎∎

The vast majority of objects ending up in bin0 (<16K) are actually very small: most are < 1.5K. Thus, they might be causing contention/fragmentation due to the size mismatch between e.g. CL ~150B and CL ~15K. We could lower the bar for objects entering bin0, putting for instance only objects <1K in there and so forth, but at this point it's not clear how beneficial it is to cache those small objects in the backends. If they're popular they should be already cached at the frontend layer, especially given their small size.

So the idea now is to avoid caching objects with size < N at the backend layer. We can start with N=256B or so at first, make sure this doesn't cause problems, and perhaps incrementally bump N up to ~1K.

However, deciding not to cache a certain category of objects means that we have to think of how to deal with serialization of concurrent requests. By default, varnish doesn't hit the origin server with multiple requests when concurrent requests for the same object come in. Instead, in the common case, it puts the concurrent requests "on hold", issues one single request to the origin server, caches the response and returns the object to all clients whose requests were awaiting.

If we want to avoid caching certain objects, in this case those with CL < N, this smart behavior of varnish becomes detrimental when multiple requests for the same, not-to-be-cached object come in. Varnish has a mechanism called hit-for-pass which is designed to solve exactly this problem. If a certain object should not be cached, create a hfp object instead to remember the no-cache decision and avoid request serialization by sending concurrent requests straight to the origin server.

One solution would thus be simply using hfp for objects with CL < N. However, varnish stores hfp objects in a separate storage area called 'Transient storage', which could become bloated on backends handling requests for many objects with CL < N. The initial idea was to avoid hfp on the backend and use random backend selection on the frontend instead to avoid chashing multiple requests for the same object to one single backend and reduce the likelihood of serialization. Upon further reflection, this is impossible given that when the request comes in at the frontend layer we don't yet know the response CL. So the alternatives at this point are simple: either we use hfp or we don't at the backend layer. Avoiding serialization would be the pro of using hfp, with possible transient storage bloat as the con.

Change 349188 had a related patch set uploaded (by Ema):
[operations/puppet@production] cache_upload: don't cache tiny objects at the backend layer

https://gerrit.wikimedia.org/r/349188

Change 349188 merged by Ema:
[operations/puppet@production] cache_upload: do not cache tiny objects at the backend layer

https://gerrit.wikimedia.org/r/349188

Change 349389 had a related patch set uploaded (by Ema):
[operations/puppet@production] cache_upload: increase small object threshold to 1024B

https://gerrit.wikimedia.org/r/349389

I've analyzed 1 minute of frontend traffic on cp2011: the hitrate for objects with CL < 1024 is 97.7%. It should be fine to increase N to 1024.

For reference, the procedure I've used to come up with the hitrate was:

$ timeout 60 varnishncsa -n frontend -q 'ReqMethod eq "GET" and RespStatus eq 200' -F '%{Content-Length}o %{Age}o %{X-Cache-Int}o' | awk '$1 > 0 && $1 < 1024' | tee 1024-frontend-hitrate.log
[...]
$ awk '{print $NF}' 1024-frontend-hitrate.log | grep -c hit
3927
$ awk '{print $NF}' 1024-frontend-hitrate.log | grep -c miss
60
$ python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> hit = 3927 ; miss = 60
>>> 100 * (hit / (0.0001 + hit + miss))
98.49510663418343

Hitrate is even a bit higher now. :)

Change 349389 merged by Ema:
[operations/puppet@production] cache_upload: increase small object threshold to 1024B

https://gerrit.wikimedia.org/r/349389

Mentioned in SAL (#wikimedia-operations) [2017-04-21T14:26:24Z] <ema> ban objects with CT < 1024 on codfw cache_upload T145661

@BBlack suggested that the possible underlying issue could be lock contention between the expiry thread and the worker threads. Indeed this seems confirmed by perf and systemtap.

In particular, I've used the following stap probe to instrument the varnishd backend process for 10 seconds and print: thread_id, lock_id, number of times the given thread acquired the lock and average time in ns:

# Adapted from https://sourceware.org/systemtap/examples/process/futexes.stp
#
global exp_tid, lock_waits
global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */
global FUTEX_PRIVATE_FLAG = 128 /* linux 2.6.22+ */
global FUTEX_CLOCK_REALTIME = 256 /* linux 2.6.29+ */

probe process("/usr/sbin/varnishd").statement("exp_thread@cache/cache_expire.c:*") {
    if (pid() != target() || exp_tid) next

    exp_tid = tid()
}

probe syscall.futex.return {
    if (pid() != target()) next
    if ((@entry($op) & ~(FUTEX_PRIVATE_FLAG|FUTEX_CLOCK_REALTIME)) != FUTEX_WAIT) next

    t = tid()
    uaddr = @entry($uaddr)
    elapsed = gettimeofday_us() - @entry(gettimeofday_us())
    lock_waits[t, uaddr] <<< elapsed
}

probe timer.s(10) {
    exit()
}

probe end {
    foreach ([tid+, u_addr] in lock_waits) {
        if (tid == exp_tid)
            printf("exp_")

        printf("tid=%d lock=%p: %d %d us\n", tid, u_addr, @count(lock_waits[tid, u_addr]), @avg(lock_waits[tid, u_addr]))
    }
}

This is the output for the expiry thread on cp2017, where varnish-be has been running for ~3 days:

$ grep '^exp_tid' exp.log | sort -rn -k 3
exp_tid=36166 lock=0x7f31034996a8: 131 67330 us
exp_tid=36166 lock=0x7f3103499748: 18 12128 us
[...]

So in this case the lock acquired more often is 0x7f31034996a8 (131 times). Looking at how often other threads (workers) obtained that lock:

$ grep '^tid' exp.log | awk '/0x7f31034996a8/ {sum+=$3} END { print sum }'
2713

On cp2020, where varnish-be has been restarted 7 hours ago, the lock obtained more often by the expiry thread was obtained 14 times

$ grep '^exp_tid' exp.log | sort -rn -k 3
exp_tid=1412 lock=0x7fed0a8993d8: 14 13 us
exp_tid=1412 lock=0x7fed0a8995b8: 4 5 us
[...]

While in total all the workers obtained that lock only 21 times:

$ grep ^tid exp.log | awk '/0x7fed0a8993d8/ { sum+=$3 } END { print sum }'
21

The current line of thinking is that we could try to change the priority level of the expiry thread to make sure it always goes first when multiple threads are waiting on the same lock. I've prepped this patch https://phabricator.wikimedia.org/P5317 and installed a patched version of varnish on cp1008. Later today we'll try it out on a varnish-be in codfw (with weekly restarts disabled) and see if that helps.

Mentioned in SAL (#wikimedia-operations) [2017-04-24T14:08:25Z] <ema> re-pooling cp2002's varnish-be with increased priority for expiry thread T145661

Change 349961 had a related patch set uploaded (by Ema):
[operations/puppet@production] varnish: set LimitRTPRIO=infinity in systemd unit file

https://gerrit.wikimedia.org/r/349961

Change 349961 merged by Ema:
[operations/puppet@production] varnish: set LimitRTPRIO=infinity in systemd unit file

https://gerrit.wikimedia.org/r/349961

Change 349965 had a related patch set uploaded (by Ema):
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm2: add 0006-exp-thread-realtime.patch

https://gerrit.wikimedia.org/r/349965

Change 349965 merged by Ema:
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm2: add 0006-exp-thread-realtime.patch

https://gerrit.wikimedia.org/r/349965

Mentioned in SAL (#wikimedia-operations) [2017-04-24T15:07:46Z] <ema> varnish 4.1.5-1wm2 uploaded to apt.w.o T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-24T16:11:13Z] <ema> upgrade cp2017 varnish-be to varnish 4.1.5-1wm2, expiry thread lock/priority workaround T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-24T17:35:15Z] <bblack> upgrade cp2024 varnish-be to varnish 4.1.5-1wm2, expiry thread lock/priority workaround T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-25T07:14:53Z] <ema> upgrade cp3033 varnish-be to varnish 4.1.5-1wm2, expiry thread lock/priority workaround T145661

Change 350520 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm3: add exp_thread_rt and exp_lck_inherit

https://gerrit.wikimedia.org/r/350520

Change 350520 merged by Ema:
[operations/debs/varnish4@debian-wmf] 4.1.5-1wm3: add exp_thread_rt and exp_lck_inherit

https://gerrit.wikimedia.org/r/350520

Mentioned in SAL (#wikimedia-operations) [2017-04-27T07:55:01Z] <ema> varnish 4.1.5-1wm3 uploaded to apt.w.o T145661

Change 350526 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: hiera flag to enable expiry thread RT experiment

https://gerrit.wikimedia.org/r/350526

Change 350526 merged by Ema:
[operations/puppet@production] cache: hiera flag to enable expiry thread RT experiment

https://gerrit.wikimedia.org/r/350526

Some general updates on bin-sizing estimates: Based on the graph data for available bytes in each bin and comparing how fast they initially reach zero after restarts on live servers, we can make a few tweaks. Bins 1 and 2 seem slightly-oversized, while bins 0, 3, and 4 are slightly undersized. Proposed tweak to upload.pp based on the live data would be:

-    $bin0_size = floor($ssm * 0.03)
-    $bin1_size = floor($ssm * 0.20)
-    $bin2_size = floor($ssm * 0.43)
-    $bin3_size = floor($ssm * 0.30)
-    $bin4_size = floor($ssm * 0.04)
+    $bin0_size = floor($ssm * 0.04)
+    $bin1_size = floor($ssm * 0.18)
+    $bin2_size = floor($ssm * 0.40)
+    $bin3_size = floor($ssm * 0.32)
+    $bin4_size = floor($ssm * 0.06)

Rolling this out as a change would be tricky though, we'd need to ensure the storage files were deleted between the daemon's stop and start, instead of letting varnish assume it can mmap/fallocate/ftruncate each as it iterates initializing them. Perhaps it would be wise to hack a storage-wipe into the initscripts anyways for the general case? It might re-set any FS-level oddities that impact perf, makes changes like this easier, etc. The cost is probably slightly longer startup times on restart, but we've got depooling around those events already.

Mentioned in SAL (#wikimedia-operations) [2017-04-27T14:26:35Z] <ema> varnish 4.1.5-1wm4 uploaded to apt.w.o T145661

Mentioned in SAL (#wikimedia-operations) [2017-04-27T14:34:49Z] <ema> upgrade upload-codfw to varnish 4.1.5-1wm4 T145661

I've re-run the binning analysis, with a few minor changes:

  1. Fresher data (61 days ending 2017-05-02)
  2. Included maps.wikimedia.org data alongside upload (it also always gives content-length, and we're merging the clusters soon)
  3. I created a new bin covering 0-1K. This is to help sort out the effect of recent optimizations where the backends explicitly refuse to cache sub-1K objects.

These results have 7 bins total. Bins 1-5 below correspond to the original 0-4 that we deployed. Bin 6 is still over-size 1G-16G bin, and the new Bin 0 is 0-1K.

I'm copying the old results from above as well so we don't have to scroll so much for comparison:

OLD:

total size: 2521817293136
Bin 0 - Count: 10261768 Size: 75203013291 AvgSize: 7328.46555203743 SpacePct: 0.0298209602637317
Bin 1 - Count: 8391307 Size: 512972317454 AvgSize: 61131.3967483254 SpacePct: 0.203413751999493
Bin 2 - Count: 1062238 Size: 1097628006519 AvgSize: 1033316.45687595 SpacePct: 0.43525278754594
Bin 3 - Count: 108056 Size: 803513093533 AvgSize: 7436080.30588769 SpacePct: 0.318624626661114
Bin 4 - Count: 216 Size: 30565707048 AvgSize: 141507903 SpacePct: 0.0121205081475154
Bin 5 - Count: 1 Size: 1935155291 AvgSize: 1935155291 SpacePct: 0.000767365382205601

NEW:

bblack@oxygen:~/ulsizes$ zegrep -h '"http_method":"GET","uri_host":"(maps|upload)\.wikimedia\.org"' /srv/log/webrequest/archive/sampled-1000.json-20170*.gz | grep '"http_status":"200"'|grep -v '"response_size":0,'|jq -r -c '[.uri_path,.response_size]'| ./analyze.pl
total size: 2666887824600
Bin 0 - Count: 482066 Size: 273142748 AvgSize: 566.608613758282 SpacePct: 0.000102420036373659
Bin 1 - Count: 11822125 Size: 90469096997 AvgSize: 7652.52414409423 SpacePct: 0.0339230979880338
Bin 2 - Count: 9927706 Size: 609440600749 AvgSize: 61387.8574515603 SpacePct: 0.228521273046199
Bin 3 - Count: 1164538 Size: 1088004151626 AvgSize: 934279.647058318 SpacePct: 0.40796772237287
Bin 4 - Count: 100149 Size: 796129478811 AvgSize: 7949450.1074499 SpacePct: 0.2985237967144
Bin 5 - Count: 573 Size: 79972067391 AvgSize: 139567307.837696 SpacePct: 0.0299870383198419
Bin 6 - Count: 2 Size: 2599286278 AvgSize: 1299643139 SpacePct: 0.000974651522281355

OLD:

total reqs: 198300770
Bin 0 - Reqs: 153783578 ReqPct: 0.775506711345599
Bin 1 - Reqs: 41338117 ReqPct: 0.208461706931345
Bin 2 - Reqs: 2872423 ReqPct: 0.0144851832900094
Bin 3 - Reqs: 306242 ReqPct: 0.00154433086669305
Bin 4 - Reqs: 409 ReqPct: 2.06252350911194e-06
Bin 5 - Reqs: 1 ReqPct: 5.04284476555487e-09

NEW:

bblack@oxygen:~/ulsizes$ zegrep -h '"http_method":"GET","uri_host":"(maps|upload)\.wikimedia\.org"' /srv/log/webrequest/archive/sampled-1000.json-20170*.gz |grep '"http_status":"200"'|gr
ep -v '"response_size":0,'|jq -r -c .response_size|./binhits.pl
total reqs: 196894144
Bin 0 - Reqs: 53246445 ReqPct: 0.270431836713234
Bin 1 - Reqs: 94771124 ReqPct: 0.481330333521753
Bin 2 - Reqs: 45988501 ReqPct: 0.233569673864958
Bin 3 - Reqs: 2660789 ReqPct: 0.0135138046563741
Bin 4 - Reqs: 226064 ReqPct: 0.00114814994193022
Bin 5 - Reqs: 1219 ReqPct: 6.1911440088335e-06
Bin 6 - Reqs: 2 ReqPct: 1.0157742426306e-08

Reformatting this a bit for comparison using the old Bin numbers, and calling the new data's Bin0 "Bin-1" (and then also including a combined figure in Bin0 as well for comparison).

Storage Size Percentages
(how much total space do unique objects within each bin take, relative to the whole):

BinSizeRangeOLD StoragePctNEW StoragePct
Bin-1(0-1K)Bin0.01%
Bin0(1K-16K)3.0%3.4% (3.41% w/ Bin-1)
Bin1(16K-256K)20.3%22.9%
Bin2(256K-4M)43.5%40.8%
Bin3(4M-64M)31.9%29.6%
Bin4(64M-1G)1.2%3.0%

Request Percentages
(how many total requests do we get for objects within each bin, relative to the whole):

BinSizeRangeOLD ReqPctNEW ReqPct
Bin-1(0-1K)Bin027.0%
Bin0(1K-16K)77.6%48.1% (75.2% w/ Bin-1)
Bin1(16K-256K)20.8%23.4%
Bin2(256K-4M)1.4%1.4%
Bin3(4M-64M)0.2%0.1%
Bin4(64M-1G)~0%~0%

Byte Load Percentages
(how many bytes do we transfer per time slice from each bin, relative to the whole - should be a proxy for disk i/o load):

BinSizeRangeOLD XferPctNEW XferPct
Bin-1(0-1K)Bin00.4%
Bin0(1K-16K)12.6%9.0% (9.4% w/ Bin-1)
Bin1(16K-256K)28.2%35.1%
Bin2(256K-4M)33.1%30.0%
Bin3(4M-64M)25.4%22.4%
Bin4(64M-1G)0.6%2.1%

Today, 2017-05-04, this issue affected 4 out of 8 of the text-esams hosts roughly at the same time, resulting in a peak of ~600 failed requests per second.

hostrunning sincefetch failed timeframemailbox lag timeframepeak mailbox lagcached objects
cp30412017-04-28 06:45:1607:03 - 07:2006:58 - 07:21658K15.1M
cp30432017-04-29 04:4106:58 - 07:1406:55 - 07:14653K16.1M
cp30322017-04-29 21:57:1906:53 - 07:0606:50 - 07:07552K14.6M
cp30302017-04-30 14:45:2106:41 - 06:4906:37 - 06:50327K14.1M

Note that @Joe manually restarted varnish-be on cp3043 at 07:14 (thanks!). All other varnish backends managed to recover on their own.

There's a correlation between the beginning of fetch failures and the time when the host reaches thread_pool_max (12k) backend threads. See e.g. cp3041.

Unaffected hosts had been running for 2 days ago or less when the issues occurred, with the exception of cp3040:

hostrunning sincecached objects
cp30402017-04-27 08:40:0523.8M
cp30422017-05-01 13:09:2110.9M
cp30312017-05-02 11:33:219.4M
cp30332017-05-03 09:57:177.8M

So the surprise here seems to be cp3040. It only had a brief 503 spike around 07:15 despite its long uptime and then quickly recovered. The weekly cron job restarted its varnish-be about one hour later, at 08:21.

Change 351802 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: stop expiry thread RT experiment on cp2024

https://gerrit.wikimedia.org/r/351802

Change 351802 merged by Ema:
[operations/puppet@production] cache: stop expiry thread RT experiment on cp2024

https://gerrit.wikimedia.org/r/351802

Mentioned in SAL (#wikimedia-operations) [2017-05-09T17:25:57Z] <elukey> executing varnish-backend-restart on cp1074 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661

Mentioned in SAL (#wikimedia-operations) [2017-05-09T17:29:20Z] <elukey> executing varnish-backend-restart on cp1072 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661

This is what we had before (copied from far above):

BinSizeRangeStoragePctStorageSize (1 node)Disk
Bin0(0-16K)3%21.6GA
Bin1(16K-256K)20%144GB
Bin2(256K-4M)43%309.6GA
Bin3(4M-64M)30%216GB
Bin4(64M-1G)4%28.8GA

Based on the data from T145661#3233742 and going through the same basic pragmatic transformations, this seems like a good real-world set of values to switch to:

BinSizeRangeStoragePctStorageSize (1 node)Disk
Bin0(0-16K)4%28.8GA
Bin1(16K-256K)23%165.6GB
Bin2(256K-4M)40%288.0GA
Bin3(4M-64M)27%194.4GB
Bin4(64M-1G)6%43.2GA

Change 356394 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cache_upload: rebalance storage bins

https://gerrit.wikimedia.org/r/356394

Change 356394 merged by Ema:
[operations/puppet@production] cache_upload: rebalance storage bins

https://gerrit.wikimedia.org/r/356394

Change 376751 had a related patch set uploaded (by Ema; owner: BBlack):
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

Change 376751 merged by BBlack:
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

Change 379512 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

BBlack claimed this task.

The cache admission policy change seems to have gotten us over this for now. We should probably wait for the Varnish5 upgrade ( T168529 ) before trying to undo any of the other related work (e.g. weekly restart crons or storage splitting).

Change 379512 abandoned by Ema:
VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

Change 506090 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: move check_varnish_expiry_mailbox_lag to backend profile

https://gerrit.wikimedia.org/r/506090

Change 506090 merged by Ema:
[operations/puppet@production] cache: move check_varnish_expiry_mailbox_lag to backend profile

https://gerrit.wikimedia.org/r/506090