Page MenuHomePhabricator

8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1)
Open, HighPublic

Description

After seeing strong time-correlation between the response start regression and the rollout of Varnish 6 per DC for Eqsin on 2020-09-25:

And for Esams on 2020-09-29:

Varnish was rolled back to its previous version on 2020-10-05 on Esams host cp3052. Once the cache populated again, we can see that response start performance goes back to its pre-upgrade levels:

Which isn't the case for the rest of Esams hosts, which were not rolled back and whose response start stayed elevated:

This regression could be a side-effect of T264074: varnishkafka 1.1.0 CPU usage increase, which happens alongside the upgrade. For that issue as well, we can see that the rollback gets rid of the CPU spikes on cp3052:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I would like to know what the Traffic team is doing or planning on doing about this investigation at the moment. Now that I've narrowed it down to this Varnish update with very high confidence, this reaches the limits of what I can find on my own. The Traffic team been driving this update and has in-depth knowledge of Varnish and the edge cache stack, their expertise is required to narrow it down further.

My actions and findings are clearly documented here, I'm working to gather more information with T264987: Add cache response type and response size as new dimensions to navtiming_responsestart_by_host_seconds prometheus metric which is going to take some time to get reviewed, merged and deployed. Is this extra information blocking the Traffic team from further investigating what changed in Varnish 6 that's causing this issue? Is anyone still trying to fix T264074: varnishkafka 1.1.0 CPU usage increase with the hope that it will make this regression go away at the same time?

The Traffic team's investigation shouldn't stop at the first other source of data that suggests that nothing has changed, like self-reported ats-tls latency. This regression affects all our frontend metrics: https://performance.wikimedia.org/#!/year

Varnish 6 as it's been deployed makes the experience worse for our users. Now that the regression has been going on for a while, we can also see that it has caused a dip in user satisfaction ratio (this is measured with a survey asking thousands of users their opinion daily):

That's a 2% reduction in people who think that pages load "fast enough". Meaning that the effect is definitely big enough for users to notice.

Debating whether this is a magnitude of regression we can afford to make permanent when balanced with other factors should happen after the regression is fully understood and reasonable attempts have been made at fixing it if we have any leads. The Performance team can inform the debate of whether we can afford such a regression being made permanent and the Traffic team, as owners of the systems involved, will decide what to do and will be accountable for the final decision.

Currently, however, I don't have confirmation that the Traffic team has a good understanding as to why this is happening and I see no updates indicating where and how they're looking to figure this out. Maybe the fix is as simple as a new configuration setting in Varnish 6. But we're not getting any information here that suggests that the Traffic team is looking at things of that nature. Maybe the Traffic team doesn't know how to figure out that Varnish 6 is causing this and is out of ideas, in which case it should be stated, so that everyone else can know whether we've reached the limits of what can be found out here. It helps to share which dead ends have been encountered as well, in order to reduce duplicated efforts and maybe come up with alternative ideas on what to look into.

If the Traffic team is denying that anything is wrong, which also seems like a possibility given the nature of the few responses we've seen on this task that were actually about the subject matter, then Traffic has to explain what we're seeing on cp3052, why all frontend metrics are going up across the board as a result of this update (as demonstrated by extremely strong temporal correlation to the staged deployment) and why user sentiment is dropping.

It's not for every web performance regression that we get such a strong signal pointing in a single direction. It is however pretty common to see conflicting data when trying to look deeper inside our infrastructure, like when you're looking at ats-tls-reported latency. That's generally when we find that this internal monitoring is only a partial view, is possibly incorrect or misleading and that maybe we're not measuring some things that we should be. There should be some actionables filed as subtasks if the Traffic team identifies blind spots in the internal monitoring of the edge cache. Besides T264987, are there things we're not currently measuring that the Traffic team thinks could help with this investigation?

Lastly, Varnish 6.0 is 5 minor versions behind the current latest stable. Has upgrading Varnish further been considered? Maybe there are too many breaking VCL changes to upgrade to 6.1 easily as an experiment, but it would be helpful to know if it's been looked into. A further upgrade is something we could try out on cp3052, for example. If the current investigation has reached a dead end for the Traffic team, this is probably a good candidate for a next step, hoping that we're currently hitting a bug that's been fixed in later versions.

With hey on cp3052 (Varnish 5.1.3-1wm15) and cp3054 (6.0.6-1wm1) I obtained the following two latency distributions in milliseconds when benchmarking cache hits for enwiki's Main_page (~80k in size):

./hey -n 1000 -c 10 -H "X-Forwarded-Proto: https" -host "en.wikipedia.org" http://localhost/wiki/Main_Page

5.1.3-1wm156.0.6-1wm1
p101.61.6
p252.02.0
p502.52.4
p753.23.1
p904.94.7
p956.66.7
p9910.911.5

Latencies for the page about U.C. Sampdoria, which is larger (~268K), look like this:

5.1.3-1wm156.0.6-1wm1
p102.32.3
p252.92.8
p503.53.2
p754.33.8
p905.74.9
p958.56.0
p9915.210.5

Results for Accept-Encoding: gzip are similarly unremarkable.

The data does not seem to show significant differences between the performance of Varnish 5 and Varnish 6 when it comes to this sort of cache hits. As a next step we can try to observe the various Varnish milestone timestamps on miss/pass traffic, and see if there's any difference there.

Thanks for trying that out. I ran the same command for much longer with higher concurrency. For Main_Page, I saw no difference.

Then I tried our go-to large article, with the benchmark running concurrently on both hosts.

./hey -n 1000000 -c 1000 -H "X-Forwarded-Proto: https" -host "en.wikipedia.org" http://localhost/wiki/Barack_Obama

A significant difference emerged. You can feel it right away seeing how long it takes to progress through the benchmark. The total runtime for cp3054 was 25% longer.

cp3052

Summary:
  Total:	538.3917 secs
  Slowest:	3.6410 secs
  Fastest:	0.0098 secs
  Average:	0.5122 secs
  Requests/sec:	1857.3839

Status code distribution:
  [200]	1000000 responses

Response time histogram:
  0.010 [1]	|
  0.373 [408769]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.736 [380391]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.099 [141824]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.462 [40537]	|∎∎∎∎
  1.825 [17190]	|∎∎
  2.189 [8190]	|∎
  2.552 [1721]	|
  2.915 [527]	|
  3.278 [482]	|
  3.641 [368]	|

Latency distribution:
  10% in 0.0878 secs
  25% in 0.2483 secs
  50% in 0.4448 secs
  75% in 0.6846 secs
  90% in 0.9694 secs
  95% in 1.2226 secs
  99% in 1.8680 secs

cp3054

Summary:
  Total:	676.5448 secs
  Slowest:	3.6392 secs
  Fastest:	0.0100 secs
  Average:	0.6690 secs
  Requests/sec:	1478.0987

Status code distribution:
  [200]	1000000 responses

Response time histogram:
  0.010 [1]	|
  0.373 [168978]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.736 [475582]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.099 [263823]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.462 [70445]	|∎∎∎∎∎∎
  1.825 [16243]	|∎
  2.187 [3935]	|
  2.550 [790]	|
  2.913 [166]	|
  3.276 [29]	|
  3.639 [8]	|

Latency distribution:
  10% in 0.2956 secs
  25% in 0.4492 secs
  50% in 0.6256 secs
  75% in 0.8387 secs
  90% in 1.0774 secs
  95% in 1.2484 secs
  99% in 1.6502 secs

Maybe this issue only affects larger payloads?

Is the Barack Obama article too big to be stored in Varnish? I can't remember what the limit is.

I've checked with curl and it does get cached by both hosts:

X-Cache: cp3060 miss, cp3052 hit/1000011
X-Cache-Status: hit-front

X-Cache: cp3060 miss, cp3054 hit/1000019
X-Cache-Status: hit-front

Oddly, cp3054 has Content-Length defined in the headers I get back and not cp3052?

Thanks for trying that out. I ran the same command for much longer with higher concurrency. For Main_Page, I saw no difference.

Then I tried our go-to large article, with the benchmark running concurrently on both hosts.

./hey -n 1000000 -c 1000 -H "X-Forwarded-Proto: https" -host "en.wikipedia.org" http://localhost/wiki/Barack_Obama

Oh, it seems that you've tried that between 07:25 and 07:35? During that timeframe, p75 webperf on cp3052 (Varnish 5) was bad, up to 4.3 seconds (!). See the usual dashboard, but with the rate computed over 5 minutes instead of 1 hour:

I don't see the same effect on cp3054, interestingly. Also during the benchmark we've had a brief 502 spike on cp3052, while that did not happen on cp3054.

This might indicate a difference in the way v5 and v6 handle significant rates of concurrent requests with large responses. Can the work currently in progress on splitting webperf by cache_status (T264987) be extended to also include response size?

While I do think the data is very interesting, we should move the benchmarks elsewhere to avoid disrupting production traffic (I'm not blaming Gilles for trying the command, just to be very clear). Perhaps deployment-cache-text06.deployment-prep.eqiad.wmflabs might be a good labs instance to use for this purpose, hoping that its limited resources do not end up interfering with the results.

On a different angle: while looking at various Varnish internal metrics, I've noticed that there's a very visible decrease in LRU lock operations on Varnish 6 compared to Varnish 5. See increase(varnish_lock_operations{instance=~"cp305[26]:9331",target="lru"}[5m])), the difference after the downgrade of cp3052 is quite clear:

Going through the various commits that led to the 6.0.x series, it seems to me that the most likely reason is https://github.com/varnishcache/varnish-cache/issues/2527. If anything, less lock contention should result in better performance on v6. However, as an experiment I've tried disabling lru_interval (ie: setting it to 0) on cp3054. Let's see what difference that makes performance-wise, if any. So far, I don't see anything interesting at all, so this is likely a red herring.

I tried a bunch of different benchmark settings before I ended up using those, some of which definitely had too much concurrency and caused errors even on the benchmark responses themselves. So I'm not sure if you're seeing the failed attempts where the settings were really excessive or the final run on the Barack Obama article. It's probably worth running again. The fact that these latest settings led to 100% 200 responses contrary to the excessive ones leads me to believe that they're safe.

My shell history on those hosts isn't timestamped, so I can't confirm based on that.

I can definitely add response size as another per-host dimension.

If it was indeed the last benchmark run, what you're saying is that cp3052 favored the benchmark requests over others, while cp3054 remained more fair in regards to the rest of the traffic? I think that even moving this to WMCS we're always going to run into problems like that with benchmarks. We're more likely to discover bias caused by the benchmark itself than track down the root cause of what we're seeing in organic traffic.

I've checked with curl and it does get cached by both hosts:

[...]

Oddly, cp3054 has Content-Length defined in the headers I get back and not cp3052?

Interesting, which curl command exactly gives you no CL on 3052?

I see CL being returned by Varnish on both hosts:

09:43:05 ema@cp3054.esams.wmnet:~
$ curl -v -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" localhost/wiki/Barack_Obama 2>&1 | grep Length
< Content-Length: 1479288
09:43:19 ema@cp3052.esams.wmnet:~
$ curl -v -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" localhost/wiki/Barack_Obama 2>&1 | grep Length
< Content-Length: 1479288

It was:

gilles@cp3052:/home/ema$ curl -I -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" http://localhost/wiki/Barack_Obama | grep Content-Length
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
gilles@cp3052:/home/ema$

gilles@cp3054:/home/ema$ curl -I -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" http://localhost/wiki/Barack_Obama | grep Content-Length
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 1444k    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
Content-Length: 1479263
gilles@cp3054:/home/ema$

I guess the difference is that I did a HEAD request. It's still reproducible right now.

I guess the difference is that I did a HEAD request. It's still reproducible right now.

OK, I've tried to see if other responses without body like 304s to conditional requests also showed any interesting difference, but they don't. So this might just be a change in interpretation of RFCs when it comes to HEAD, but nothing relevent for this task I think.

I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4 variants of this command for hit-front, hit-local, miss and pass:

varnishlog -n frontend -q "RespHeader eq 'X-Cache-Status: hit-front' and ReqURL ~ '^/wiki/*'" -i Timestamp | grep Resp | awk '{print $5}' > cp3052-hit-front.log

The above recorded the total response time from Varnish's perspective (duration shown on last Timestamp entry available for the transaction), for hit-front responses on /wiki/* URLs.

Then I looked at percentiles on each dataset. The lower number is bolded.

hit-front

hostsamplesminp25medianmeanp75maxstddev
cp30529056710.0000000.0002540.0002910.0010170.0003365.4204070.0305825
cp30548892870.0000000.0002560.0002910.0009290.0003348.0071570.0303075
Difference0.8%0%-9.5%-0.59%47%-0.9%

hit-local

hostsamplesminp25medianmeanp75maxstddev
cp3052204800.0006110.0032310.0062070.0363190.0506161.9453260.07614508
cp3054241200.0006020.0029790.0057850.0373720.0410391.9738920.08716184
Difference-1.5%-8.5%-7.3%2.9%-23.3 %1.5%14.5%

miss

hostsamplesminp25medianmeanp75maxstddev
cp30522430290.11620.31250.41360.48760.55007.27660.319049
cp30542516760.11610.29540.40330.47490.54594.98990.2933685
Difference-0.08%-5.8%-2.5%-2.7%-0.7%-45.8%-10.9%

pass

hostsamplesminp25medianmeanp75maxstddev
cp3052678110.11640.19140.33250.46670.55237.40850.5117037
cp3054641700.11590.17840.29680.42950.53924.99150.3971842
Difference-0.4%-7.3%-12%-8.7%-2.4%-48.4%-28.8%

I don't know if the time reported by Varnish includes the transmission time to the client, but as far as its own telemetry goes, cp3054 looks faster or at least as performant.

Seeing this, I pulled the RUM data for cp3052 and cp3054 for some offline processing, using this HQL query that reproduces the filtering we do when generating the Prometheus metric that ends up on the per-host dashboard. It looks at data from 2020-10-07 onwards:

SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responsestart - event.fetchstart >= 0;
hostsamplesminp25medianmeanp75maxstddev
cp305212845201082574695201482641343
cp305412817601152854835436256642129
Difference0.9%10.9%2.9%4.4%333%58.5%

I've also looked at the same data, this time including the oversamples. That extra data isn't included in the Prometheus per-host metrics.

hostsamplesminp25medianmeanp75maxstddev
cp305267010801784035976502451491392
cp305467434101894206046609979722184
Difference6.2%4.2%1.2%1.5%307%56.9%

Percentiles are a little higher for cp3054, less pronounced when including the geographically-biased oversamples. The maximum value and standard deviation are a lot higher in both cases.

It's possible that the extra time comes from something Varnish doesn't measure. It's unclear to me whether the last Timestamp in a Varnish response includes the time it took to actually ship the bytes to the client (ats-tls in this case?) and have them acknowledged.

I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4 variants of this command for hit-front, hit-local, miss and pass:

varnishlog -n frontend -q "RespHeader eq 'X-Cache-Status: hit-front'" -q "ReqURL ~ '^/wiki/*'" -i Timestamp | grep Resp | awk '{print $5}' > cp3052-hit-front.log

[...]

It's possible that the extra time comes from something Varnish doesn't measure. It's unclear to me whether the last Timestamp in a Varnish response includes the time it took to actually ship the bytes to the client (ats-tls in this case?) and have them acknowledged.

The command above prints the 5th field of the Resp timestamp as reported by Varnish. From vsl(7):

The reported fields show the absolute time of the event,  the  time
spent  since  the  start of the task and the time spent since
the last timestamp was logged.

Log lines look like this:

-   Timestamp      Resp: 1603186764.780741 0.000476 0.000081

And Resp is:

Resp   Delivery of response to the client finished.

So that would be the time spent between the start of the task and the full delivery of the response to the client including the whole body. The client is ats-tls for https traffic, the actual user-agent for http.

My understanding is that the RUM data instead shows the time spent between the beginning of the client request, event.fetchstart, and the moment when the first bytes are received, event.responsestart. It seems to me that if we want to look into Varnish's own telemetry and compare things with event.responsestart - event.fetchstart, then we should use the so called Process timestamp:

Process
       Processing  finished,  ready to deliver the client re‐
       sponse.
SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responsestart - event.fetchstart >= 0;

I've noticed that on nodes with Varnish 6 the worst time_firstbyte values reported by ats-tls are very often around 26 seconds, and they're due to etherpad. Can you try this once again, but excluding Host: etherpad.wikimedia.org?

Change 635298 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: fix websockets on 6.x

https://gerrit.wikimedia.org/r/635298

SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responsestart - event.fetchstart >= 0;

I've noticed that on nodes with Varnish 6 the worst time_firstbyte values reported by ats-tls are very often around 26 seconds, and they're due to etherpad. Can you try this once again, but excluding Host: etherpad.wikimedia.org?

We don't collect RUM data on etherpad.wikimedia.org. As for Varnish data, it checks for ReqURL ~ '^/wiki/*. As far as I know, etherpad URLs all start with /p/.

Fair point about TTFB from Varnish's perspective, I'll re-run the collection of data I did before, this time on Process instead of Resp.

Change 635298 merged by Ema:
[operations/puppet@production] varnish: fix websockets on 6.x

https://gerrit.wikimedia.org/r/635298

Here's the same data collected with commands like the following (using Process), over a 30 minute period.

varnishlog -n frontend -q "RespHeader eq 'X-Cache-Status: hit-front' and ReqURL ~ '^/wiki/*'" -i Timestamp | grep Process | awk '{print $5}' > cp3052-hit-front.log

hit-front

hostsamplesminp25medianmeanp75maxstddev
cp305211750960.0001380.0002160.0002470.0006970.0002784.2690780.01909279
cp305411332260.0000000.0002210.0002490.0011070.00027911.8366550.03249044
Difference-3.7%-100%2.3%0.8%58.8%0.4%177.2%70.2%

hit-local

hostsamplesminp25medianmeanp75maxstddev
cp3052327680.0005490.0010420.0012370.0147840.0018081.7241090.05559903
cp3054323120.0005810.0010740.0012900.0152970.0018341.5552660.0528903
Difference-1.4%5.8%3.1%4.3%3.7%1.4%-10.8%-5.1%

miss

hostsamplesminp25medianmeanp75maxstddev
cp30522826240.11620.28160.34930.41390.43577.07940.2720957
cp30543290450.11610.28290.34500.40530.42974.21020.2469989
Difference16.4%-0.1%0.5%-1.2%-2.1%-1.4%-68.1%-10.16%

pass

hostsamplesminp25medianmeanp75maxstddev
cp3052832850.11600.20230.31600.39170.43505.77820.3779185
cp3054992140.00000.18150.30580.36300.43154.62770.2776744
Difference19.1%-100%-11.4%-3.3%-7.9%-0.8%-24.8%-36.1%

cp3054 seems to be consistently a little faster for miss and pass, and overall a little slower for hit-front and hit-local. But I still can't see anything in the order of the extra tens of milliseconds seen on clients.

I thought I could also look at RUM data all the way to responseEnd, which isn't information we have per host on Grafana. This can show us whether the higher TTFB might be compensated by subsequent faster delivery of data.

SELECT event.responseend - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responseend - event.fetchstart >= 0;

There were a few freak high values that I filtered out (I removed anything greater than 10,000,000. i.e. 2.7 hours...)

Without geo oversamples

hostsamplesminp25medianmeanp75maxstddev
cp30522768040128306589618761029814665
cp30542771330146335621642786657516909
Differencen/a0%14%9.5%5.4%3.9%3.4%15.3%

With geo oversamples

hostsamplesminp25medianmeanp75maxstddev
cp30521464285021046169876776102986939
cp30541466906022248371377878665758016
Differencen/a0%5.7%4.8%2.1%1.4%3.4%15.5%

It doesn't seem like data is delivered faster, as the difference remains all the way to responseend.

I stumbled on T266040 while looking at something unrelated, but now I'm remembering that earlier in this ticket, there was some mention of a possible correlation with larger response sizes here in this report as well. The stuff in T266040 would have some statistical negative effect primarily on objects >= 256KB (the hieradata-controlled large_objects_cutoff), and a secondary effect on overall backend caching efficiency for everything else (by wasting space on pointless duplications of content). AFAIK what we're looking at in that ticket isn't something that changed from V5 to V6, though... but I could be wrong, or it could be that some other subtle change in V6 behavior exacerbated the impact.

Seeing that for some reason in my 30 minute test cp3054 was getting significantly more miss and pass requests than cp3052, I've just realised that I don't think I've looked at frontend hit rate before/after the Varnish 6 deployment.

Looking at the frontend hit rate, I think the change observed there might explain this regression. It's particularly striking for eqsin before/after the Varnish 6 migration (looking at 2xx responses):

It seems like we lost about 2% of frontend hits at that point.

It's more subtle for esams (although might be more visible by sampling per hour/day and not by 5m chunk as the graph currently does), but it does happen on the later date for that cluster's Varnish 6 migration:

I think that a 1-2% frontend hit rate loss on 2xx responses could very well explain what we're seeing in aggregate client side performance.

Comparing percentages for that 30-minute test, which was only looking at hit-front/hit-local/miss/pass for requests to /wiki/ URLs (approximation of wiki article URLs) since October 7:

hosthit-fronthit-localmisspass
cp305274.6%2.1%18%5.3%
cp305471.1%2.1%20.6%6.2%

I'm not sure how frontend servers are picked to serve requests (hashed by IP? URL?), but this seems like a noteworthy piece of data, which might be further confirmation of what seems to be a cluster-wide phenomenon in my previous comment.

This would suggest that while per response type our Varnish 6 deployment performs the same or better as Varnish 6, it might have more trouble keeping as many hot items in cache as previously? Do we have a dashboard/stats about how many objects Varnish frontends currently have in memory?

I've found the dashboard for total objects, and it seems like as many objects are stored now as there were before the Varnish 6 deployment. Maybe it could be an issue of objects that are stored but not found when access is attempted? I'm thinking of URL encoding issues, for example. Something we were doing in VCL that subtly stopped working in some edge cases in Varnish 6, for example.

It's a bit of a far-fetched argument, but if it is encoded-related, it might explain why eqsin is being affected more, as Asian traffic is more likely to be dealing with UTF-8 characters in URLs. Maybe this theory can be verified by looking at URL characters in wiki URLs at the Varnish level and checking whether the presence of some specific characters in the URL results in a higher rate of misses.

I'm not sure how frontend servers are picked to serve requests (hashed by IP? URL?), but this seems like a noteworthy piece of data, which might be further confirmation of what seems to be a cluster-wide phenomenon in my previous comment.

Frontend servers are hashed based on client source IP (so the same user should always land on the same frontend).

Backends are hashed based on content URL (mostly).

Change 635298 merged by Ema:
[operations/puppet@production] varnish: fix websockets on 6.x

https://gerrit.wikimedia.org/r/635298

Other than fixing websockets, this change deployed 2020-10-20 at 14:33 has improved things for the very worst requests (though they might not show up in RUM as they were mostly etherpad from what I could see). The absolute-worst time_firstbytes reported by ats-tls yesterday were around 26 seconds on Varnish 6 vs ~7 seconds on Varnish 5. Today there's no clear difference between v5 and v6.

Seeing that for some reason in my 30 minute test cp3054 was getting significantly more miss and pass requests than cp3052, I've just realised that I don't think I've looked at frontend hit rate before/after the Varnish 6 deployment.

Looking at the frontend hit rate, I think the change observed there might explain this regression.

I did notice the hitrate difference some weeks ago, but mentally dismissed it given that I could not find a correlation between time periods when the hitrate on v6 is higher than on v5 and the way the RUM graph looks like. For example, between 2020-10-20T12:58 and 2020-10-20T18:43, the hitrate on cp3058 was ~1% higher than on cp3052 and yet cp3052 seems to perform much better. It is however a very interesting angle to take in the investigation of course, I'll look into this next.

I found that there's a significant difference between the number of n_objecthead on v5 and v6:

In particular, on v6 n_objecthead is roughly the same as n_objectcore (which is desirable). On v5 there's ~2x more objecthead. This is due to a bugfix, or various patches really, in v6 (see https://github.com/varnishcache/varnish-cache/issues/2754) but it seemed like a very clear change worth pointing out. Perhaps the fixes incidentally broke some Vary-related feature?

As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was returning hits on things that it shouldn't have, that are now correctly returned as misses. It would be really interesting to figure out if there's any pattern in what tends to be a hit on v5 vs v6. Understanding that delta might let us know if we're looking at a bug or a bugfix.

As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was returning hits on things that it shouldn't have, that are now correctly returned as misses. It would be really interesting to figure out if there's any pattern in what tends to be a hit on v5 vs v6. Understanding that delta might let us know if we're looking at a bug or a bugfix.

Yeah, following various rabbit holes we found for example that when an object is out of grace but in keep, starting with 6.0.1 that's considered a miss instead of a hit. This is visible in the varnish_main_cache_hit_grace difference between v5 and v6. It does not explain really the difference in frontend hits, but it's something.

cp3054 seems to be consistently a little faster for miss and pass, and overall a little slower for hit-front and hit-local. But I still can't see anything in the order of the extra tens of milliseconds seen on clients.

It is possible that the telemetry of both ats-tls and varnish-fe are actually accurate. However, they mostly capture the situation on the local node, while RUM data includes delays introduced over the WAN. Perhaps that is the area where v5 and v6 subtly differ. I'm gonna focus on trying to reproduce the issue from the perspective of a client, targeting specific nodes. See the instructions on Wikitech, but the general idea is:

sudo setcap 'cap_net_bind_service=+ep' /usr/bin/ssh
ssh -L 443:localhost:443 cp3052.esams.wmnet
curl --resolve en.wikipedia.org:443:127.0.0.1 https://en.wikipedia.org/wiki/Main_Page

The curl above is just for illustration purposes, it may or may not be useful to measure full connection establishments.

The telemetry of both ATS-TLS and Varnish is also blind to app-level, OS-level or hardware-level delays and buffering internal to our systems. They both report timestamps before writing to their client socket, but offer no visibility about when those bytes are actually received by the next layer.

That's why I suggested that we collect data about the ATS-TLS timestamps regarding its communication with Varnish, as it would allow us to see if there is a gap between the time Varnish is about to send data and when ATS-TLS has actually received it.

Before we dig into that, the "total time" we're going to collect from ATS-TLS should be enough to know whether it can contain the extra latency or not. If it can't possibly contain it, then there's no point investigating delays or buffering between ATS-TLS and Varnish or other parts of the "total" timeframe.

That being said, I still think that the drop of hit rate is a more likely explanation than individual request performance.

Before we dig into that, the "total time" we're going to collect from ATS-TLS should be enough to know whether it can contain the extra latency or not. If it can't possibly contain it, then there's no point investigating delays or buffering between ATS-TLS and Varnish or other parts of the "total" timeframe.

Fair enough.

That being said, I still think that the drop of hit rate is a more likely explanation than individual request performance.

Even if there seems to be no clear correlation between better v6 hitrate and RUM performance as mentioned in T264398#6567470?

I think that a major caveat with that 6-hour window counter-example is that it's looking at hit rate over everything and RUM data is only looking at wiki articles. The "live" 30-minute window experiment was filtering the URLs in varnishlog.

Also, looking at a 1 hour sampling size over a 6 hour period might be a bit extreme in terms of rounding. Switch to 5 minute sampling, and while the difference in hit rate stays consistent, the response start difference becomes very noisy and the reverse correlation to hit rate difference isn't there:

For reference, 5-minute window during that particular period contains an average of 244 samples per host for RUM data:

SELECT COUNT(*) FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") >= 1603198731 AND UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") <= 1603219577 AND recvfrom = 'cp3052.esams.wmnet'

44075

Since we do have hit/miss data in RUM Hive for browsers that support Server Timing, let's look at it in the pre-v6 period and more recent times:

SELECT COUNT(*), PERCENTILE(nt.event.responsestart, 0.5) AS median_responsestart, nt.recvfrom, st.event.description
FROM event.servertiming AS st
JOIN event.navigationtiming AS nt ON st.event.pageviewtoken = nt.event.pageviewtoken
WHERE nt.year = 2020 AND nt.month = 9 AND nt.day >= 10 AND nt.day <= 24
AND st.year = 2020 AND st.month = 9 AND st.day >= 10 AND st.day <= 24
AND nt.event.isoversample = false
AND nt.recvfrom LIKE '%.esams.wmnet'
GROUP BY nt.recvfrom, st.event.description
LIMIT 100;
Hosthit-front rate 2020-09-10-> 2020-09-24hit-front rate 2020-10-07 -> 2020-10-21
cp305072.8%68.4%
cp305272.6%72.8%
cp305472.8%68.4%
cp305672.3%68.4%
cp305872.8%68.4%
cp306072.5%68.3%
cp306272.8%68.4%
cp306472.6%68.6%

Now, as one would expect given how requests are load balanced to hosts by IP, the total amount of requests sampled by each host over a 2-week period is very consistent. Which means that we can see what those requests that used to be hit-front became. Looking at the 2020-10-07 -> 2020-10-21 period:

Hosthit-fronthit-localmisspass
cp305272.8%0.9%21.6%4.7%
cp305468.4%1.2%25.6%4.8%

In other words, a 4% increase of misses and a possible 0.3% increase of backend hits.

Furthermore, let's look at the median responseStart data for the different response types, still during the 2020-10-07 -> 2020-10-21 period:

Hosthit-fronthit-localmisspass
cp3052234316629753
cp3054234329629748

I think this is solid confirmation that the global RUM performance change that started this investigation isn't a performance change of individual Varnish response types, but a lowered frontend hit rate for wiki articles. Varnish 6 still serves frontend hits, backend hits, misses and passes as fast as it used to. It's just serving 4% less frontend hits, in favor of misses when dealing with wiki pages.

I've compared 1+ million miss requests on cp3052 and cp3054 and looking at the most frequent miss URLs, there's no distinguishable pattern. Miss URLs are mostly a very flat and long tail, though.

Given that the amount of changes between 5.1.3 and 6.0.6 is considerable, I was thinking of following this "bisect-like" apporach: package Varnish 6.0.2, try it out on a node, see if there's any difference. If 6.0.2 performs better, than the regression happened between 6.0.3 and 6.0.6, otherwise earlier than 6.0.2.

I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:

[...]
Oct 27 09:43:37 cp4032 varnish-frontend[41162]:         VMOD wants ABI version 7.1
Oct 27 09:43:37 cp4032 varnish-frontend[41162]:         varnishd provices ABI version 7.0
[...]

The ABI change is not documented anywhere in the changelog, other than with the following wording in 6.0.4:

* New VRT functions for handling STRANDS (split strings) in
  VMODs. `vmod_blob` now uses strands internally.

At this point we should rebuild the whole circus (varnish-modules and our VMODs) against a varnish version earlier than 6.0.4 if we want to follow the bisect approach.

I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:

Macro facepalm:

Mentioned in SAL (#wikimedia-operations) [2020-10-27T15:25:33Z] <ema> cp4032: downgrade varnish to 6.0.4 T264398

With T266567 out of the way, we can now try different Varnish 6 versions, at least as long as they're VRT-compatible.

Mentioned in SAL (#wikimedia-operations) [2020-10-28T10:25:00Z] <ema> A:cp (except cp3052, running varnish 5) upgrade libvmod-netmapper to 1.9-1 T266567 T264398

Mentioned in SAL (#wikimedia-operations) [2020-10-29T08:53:57Z] <vgutierrez> A:cp (except cp3052, running varnish 5) upgrade libvmod-netmapper to 1.9-1 T266567 T264398

Hosthit-front rate 2020-09-10 -> 2020-09-23hit-front rate 2020-10-29 -> 2020-11-02
cp402770.3%64.4%
cp402872.2%64.2%
cp402971.2%63.8%
cp403071.3%64.3%
cp403171.9%63.9%
cp403272.1%65.4%

We should probably wait to have at least a full week of post-warmup data for confirmation. This early result suggests that 6.0.4 might be doing a little better, but definitely not "5.1.3 better"...

We should probably wait to have at least a full week of post-warmup data for confirmation. This early result suggests that 6.0.4 might be doing a little better, but definitely not "5.1.3 better"...

At this point we should probably have enough data to draw some conclusions, though ulsfo might not be the best cluster given how relatively idle it is. Looking at the RUM data I don't see any shocking difference between cp4031 and cp4032. Tomorrow I'll revert the change on cp4032 so that we fix T268243. This is anyways a useful test in that it leads to two observations:

  1. when testing a different version of varnish, let's say 6.0.4, on a given host, it would be useful to also restart the service on a 6.0.6 host to compare behavior with roughly the same number of objects
  2. we should rebuild all packages every time we test a different Varnish version (T264398#6580942 and T268243)

Mentioned in SAL (#wikimedia-operations) [2020-11-24T09:13:19Z] <ema> cp4032: switch back to varnish 6.0.6-1wm2 after T264398 experiment, fix T268243

I pulled the November data for the experiment, for reference:

Hosthit-front rate 2020-11-01 -> 2020-11-23
cp402767.23%
cp402866.71%
cp402966.82%
cp403066.99%
cp403166.84%
cp403267.12%

Indeed, no notable difference between cp4032 and the other hosts.

Varnish 6.0.7 is behaving well in terms of functionality on cp4032 (T268736).

I've rebooted cp4031 shortly after the upgrade of cp4032 for unrelated reasons (T256467) , and that gives us the chance to compare the hitrate on both hosts. So far the frontend hitrate on 6.0.7 looks about 3% higher than on 6.0.6:

Having confirmed that 6.0.7 at least does not seem to be worse than 6.0.6, later today I'll upgrade a text@esams node too.

Mentioned in SAL (#wikimedia-operations) [2020-11-30T15:34:13Z] <ema> cp3054: upgrade varnish to 6.0.7-1wm1 T268736 T264398

Mentioned in SAL (#wikimedia-operations) [2020-12-07T16:43:04Z] <ema> deployment-cache-text06: downgrade varnish to 5.2.1-1wm1 T264398

It took a while to package and rebuild libvmod-re2, libvmod-netmapper, varnish-modules, and varnishkafka against varnish 5.2.1. Long story short, we now have the following packages available on deneb and working with Varnish 5.2.1:

packageversion
libvmod-re21.4.5-1
libvmod-netmapper1.10-1
varnish-modules0.15.0-1+wmf2
varnishkafka1.1.0-2

I'm not going to upload them to apt.wikimedia.org as the plan is to test 5.2.1 on one single host. Instead, I've copied them to deployment-cache-text06, downgraded, and restarted varnish, so now https://en.wikipedia.beta.wmflabs.org/ is served by 5.2.1 and looks fine. Tomorrow I'll downgrade a text@esams host so we can see if the regression was introduced as early as 5.2.1 or if we have to look any further.

Change 646964 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: downgrade Varnish on cp3054 to 5.2.1-1wm1

https://gerrit.wikimedia.org/r/646964

Change 646964 merged by Ema:
[operations/puppet@production] cache: downgrade Varnish on cp3054 to 5.2.1-1wm1

https://gerrit.wikimedia.org/r/646964

Mentioned in SAL (#wikimedia-operations) [2020-12-08T09:56:33Z] <ema> cp3054: downgrade varnish to 5.2.1-1wm1 T264398

Change 646776 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: downgrade Varnish on cp3054 to 5.2.1-1wm1"

https://gerrit.wikimedia.org/r/646776

Change 646776 merged by Ema:
[operations/puppet@production] Revert "cache: downgrade Varnish on cp3054 to 5.2.1-1wm1"

https://gerrit.wikimedia.org/r/646776

Mentioned in SAL (#wikimedia-operations) [2020-12-08T10:13:51Z] <ema> cp3054: upgrade varnish back to 6.0.7-1wm1 T264398

I've downgraded Varnish to version 5.2.1-1wm1 on cp3054 and had to revert due to an issue with varnishlog. After the upgrade, I've noticed that systemd-journald was maxing-out a CPU and suppressing millions of messages.

Dec 08 10:10:12 cp3054 systemd-journald[24968]: Suppressed 2584898 messages from varnish-frontend-hospita
Dec 08 10:10:14 cp3054 systemd-journald[24968]: Forwarding to syslog missed 1967838 messages.
Dec 08 10:10:42 cp3054 systemd-journald[24968]: Suppressed 2491469 messages from varnish-frontend-hospita
Dec 08 10:10:44 cp3054 systemd-journald[24968]: Forwarding to syslog missed 2049042 messages.

In particular, the following command produced a flood of "Log overrun" messages:

/usr/bin/varnishlog -g raw -q 'Backend_health ~ "Back healthy|Went sick"' -n frontend

That's the command used by varnish-frontend-hospital.service.

It's likely that 5.2.1 is affected by T264074, we need to backport our fix to 5.2.x too. However I don't think that during the 6.0.x upgrades we ran into this specific varnishlog issue, there might be more work to do in that regard too.

It's likely that 5.2.1 is affected by T264074, we need to backport our fix to 5.2.x too. However I don't think that during the 6.0.x upgrades we ran into this specific varnishlog issue, there might be more work to do in that regard too.

It looks like 5.2.1 had a ton of VSL-related issues. The following issues look relevant, #2586 in particular:

The list of VSM-related issues affecting 5.2.1 according to upstream's changelog is: 2430, 2470, 2518, 2535, 2541, 2545, 2546. I suspect they forgot to list some important changes given that 2586 isn't mentioned in changes.rst at all though it looks precisely like the infinite "Log overrun" problem mentioned yesterday.

Anyhow, for now I've applied:

The patch supposedly fixing https://github.com/varnishcache/varnish-cache/issues/2518, instead, does not apply cleanly and requires porting work.

OK the amount of work needed to get 5.2.1 in a usable state really seems excessive. Let's give a try to 6.0.0, which is the version immediately after 5.2.1 and according to the changelog should address the VSM bugs introduced in 5.2.1. I've rebuilt all the dependencies, here the versions for reference:

packageversion
varnish6.0.0-1wm1
varnish-modules0.15.0-1+wmf2
varnishkafka1.1.0-2
libvmod-re21.5.3-1
libvmod-netmapper1.9-2

Mentioned in SAL (#wikimedia-operations) [2020-12-09T16:10:42Z] <ema> deployment-cache-text06: deploy varnish 6.0.0-1wm1 T264398

Change 647615 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: downgrade Varnish on cp3054 to 6.0.0-1wm1

https://gerrit.wikimedia.org/r/647615

Change 647615 merged by Ema:
[operations/puppet@production] cache: downgrade Varnish on cp3054 to 6.0.0-1wm1

https://gerrit.wikimedia.org/r/647615

Mentioned in SAL (#wikimedia-operations) [2020-12-10T09:28:46Z] <ema> cp3054: downgrade varnish to 6.0.0-1wm1 T264398

Initial results of the 6.0.0 experiment on cp3054 are encouraging: for the past 12 hours performance has been in line with 5.1.3 on cp3052:

Let's keep the experiment running during the weekend and re-evaluate on Monday.

Change 649278 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: change connect timeouts

https://gerrit.wikimedia.org/r/649278

Change 649278 merged by Ema:
[operations/puppet@production] ATS: change ats-be connect timeouts

https://gerrit.wikimedia.org/r/649278

Varnish 6.0.0 does not seem to be affected by the regression, here is the average webperf_navtiming_responsestart_by_host_seconds (p75) during the past 10 days:

hostversionp75
cp30546.0.0487ms
cp30525.1.3489ms
cp30606.0.7508ms
cp30566.0.7510ms

See the relevant dashboard. I will try 6.0.1 next.

Mentioned in SAL (#wikimedia-operations) [2021-01-04T12:52:04Z] <ema> deployment-cache-text06: try out varnish 6.0.1-1wm1 T264398

6.0.1-1wm1 has now been working fine for a day on the beta cluster, upgrading cp3054.

Mentioned in SAL (#wikimedia-operations) [2021-01-05T09:21:35Z] <ema> cp3054: upgrade varnish to 6.0.1-1wm1 T264398

Over the past 2 days, 6.0.1 has been performing worse than 5.1.3, and similarly to 6.0.7. This would seem to indicate that the regression was introduced between 6.0.0 and 6.0.1.

hostversionp75
cp30525.1.3504ms
cp30566.0.7523ms
cp30606.0.7526ms
cp30546.0.1526ms

Awesome, glad to see that the bisecting paid off! Still 361 commits between those 2 versions, though 😕

Awesome, glad to see that the bisecting paid off! Still 361 commits between those 2 versions, though 😕

Yeah, although most of those are, from a cursory look at the commit log, trivial and clearly not related (eg: h2, documentation, VTC tests). As previously noted in T264398#6568605 it seems possible that the fairly invasive changes introduced with PR 2705 might be involved in this regression. I've now built 6.0.1-1wm2 packages with P13928 applied to revert upstream PR 2705. We can give it a try next week on cp3054!

Mentioned in SAL (#wikimedia-operations) [2021-03-23T13:15:34Z] <ema> cp3054: install varnishkafka built explicitly against varnish 6.0.1-1wm2 to fix broken dpkg status T264398

The Response Time By Host dashboard shows no data, last bits of information on February 22. Is there anything we can do to fix it?

T277769 needs to be completed for the dashboard to be restored. Essentially the host data has to go through a new pipeline. The EventGate collection part is ready to merge, next I'll write the changes to the navtiming daemon. Since the data overwhelmed Grafana when looking at a whole DC, I think I'll change it to be just broken down by host, with the ability to select multiple hosts.

T277769 needs to be completed for the dashboard to be restored. Essentially the host data has to go through a new pipeline. The EventGate collection part is ready to merge, next I'll write the changes to the navtiming daemon.

Ack, thanks!

Since the data overwhelmed Grafana when looking at a whole DC, I think I'll change it to be just broken down by host, with the ability to select multiple hosts.

Sounds good, I found dashboards using webperf_navtiming_responsestart_by_host_seconds_bucket such as this one particularly useful, so it would be great to be able again to perform comparisons like webperf_navtiming_responsestart_by_host_seconds_bucket{host=~"cp3050"} vs webperf_navtiming_responsestart_by_host_seconds_bucket{host=~"cp3054"}.