Page MenuHomePhabricator

8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1)
Open, HighPublic

Description

After seeing strong time-correlation between the response start regression and the rollout of Varnish 6 per DC for Eqsin on 2020-09-25:

And for Esams on 2020-09-29:

Varnish was rolled back to its previous version on 2020-10-05 on Esams host cp3052. Once the cache populated again, we can see that response start performance goes back to its pre-upgrade levels:

Which isn't the case for the rest of Esams hosts, which were not rolled back and whose response start stayed elevated:

This regression could be a side-effect of T264074: varnishkafka 1.1.0 CPU usage increase, which happens alongside the upgrade. For that issue as well, we can see that the rollback gets rid of the CPU spikes on cp3052:

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptFri, Oct 2, 8:43 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The most logical suspect is the rollout of Varnish 6 T263557: Upgrade production cache nodes to Varnish 6

It seems to affect North America before Europe, and the timing lines up with the rollouts of codfw and esams:


After the change, it stays elevated

Gilles triaged this task as High priority.Fri, Oct 2, 8:58 AM
ema added a subscriber: ema.Fri, Oct 2, 9:03 AM

I've improved the per-DC/host dashboard: https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host

The change is clearly visible on Esams on 2020-09-29 and on Eqsin on 2020-09-25, which corresponds exactly to the Varnish 6 rollout for those DCs.

Gilles assigned this task to ema.Fri, Oct 2, 10:45 AM

@ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify that this is what's causing the issue.

ema added a comment.Fri, Oct 2, 11:12 AM

@ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify that this is what's causing the issue.

Indeed. If, as we expect, the downgrade does bring response-time-by-host back to the previous levels, the next step is to benchmark locally varnish 5.1.3-1wm15 vs 6.0.6-1wm1 and try to reproduce the regression with synthetic traffic.

BBlack added a subscriber: BBlack.Fri, Oct 2, 1:01 PM

Just throwing in some random points/counterpoints to ponder:

  • It's possible it does take more than a day or three for the frontend caches to settle into an optimal pattern and compare well against the previous week, especially given there's both daily and time-of-week variations in the natural data and it's been less than a full week so far. Do we yet have any way to distinguish the impact of cache misses when looking at this (e.g. compare ttfb increase only on synth+hit)? Either way, a few days of rollback comparison on one host will tell us something useful about this.
  • We're also battling elevated CPU utilization since the V6 upgrade in T264074 - could be related, directly or indirectly? It's enough to look like it bumps the host overlall utilization by a few percent in average, and we're getting up into a critical range there in general with peaks a bit above half.
  • Drilling into the per-host data in esams, they don't all show the pattern quite as clearly as the aggregate does. Do their individual patterns have any correlation with per-host rollout timings at least?
BBlack added a comment.Fri, Oct 2, 1:12 PM

Eh maybe a few more to think about too:

  • The train this week caused a fair amount of churn with the rollout + rollback of 1.36.0-wmf.11. Is there any chance the train issues could also have perturbed stats enough to cause the disturbances seen here (or at least, could have delayed caches reaching optimal-ish state as fast as we'd expect)? I don't think it ever reached group2, but I could be wrong. A large impact on group1 might be enough to cause some issues, though.
  • Also, service proxy changes on the inside.

Again, in both cases above (and really in general), reliably splitting up any ttfb impact by generating layer (frontend synth/hit, be synth/hit, applayer) would be super-helpful.

The terminal-layer data in the hit/miss graph might correlate in interesting ways and help with that, but I have no idea if anything useful is visible there yet: https://grafana.wikimedia.org/d/000000500/varnish-caching?viewPanel=1&orgId=1&refresh=15m&var-cluster=All&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-7d&to=now

@BBlack we don't see elevated latency that lasts for days like that on train rollouts and rollbacks. Train rollbacks are a frequent event. We're now at 6 days of consistent higher latency across the board.

Would adding the information currently exposed as x-cache-status (and in server-timing, where I can actually get it) as a third dimension on the per-host dashboard be what you want?

Gilles added a comment.EditedMon, Oct 5, 8:09 AM

As for per-host on esams, it is very clear on every host. Even clearer if you zoom out to 30 days and switch to a 1day rolling average:

ema moved this task from Triage to Bug Reports on the Traffic board.Mon, Oct 5, 9:31 AM

Change 632201 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: downgrade Varnish on cp3052 to 5.1.3-1wm15

https://gerrit.wikimedia.org/r/632201

Change 632201 merged by Ema:
[operations/puppet@production] cache: downgrade Varnish on cp3052 to 5.1.3-1wm15

https://gerrit.wikimedia.org/r/632201

Mentioned in SAL (#wikimedia-operations) [2020-10-05T10:28:54Z] <ema> cp3052: depool and downgrade varnish to 5.1.3-1wm15 T264398

Mentioned in SAL (#wikimedia-operations) [2020-10-05T10:32:13Z] <ema> cp3052: pool with varnish 5.1.3-1wm15 T264398

ema added a comment.Mon, Oct 5, 10:52 AM

Varnish downgraded on cp3052. I've made a new dashboard comparing response time on cp3052 (v5) vs cp3054 (v6): https://grafana.wikimedia.org/d/EiAVq3FGz/t264398?viewPanel=2&orgId=1

Let's wait a bit for the cache on cp3052 to refill and see.

Gilles moved this task from Inbox to Radar on the Performance-Team board.Mon, Oct 5, 6:37 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.
CDanis added a subscriber: CDanis.Mon, Oct 5, 7:06 PM

Seems like the cache is warm now and the host is faster than its peer:

ema added a subscriber: mark.Wed, Oct 7, 9:08 AM

It seems that this task is starting to follow the communication pattern that emerged in T238494, which wasn't great. Flagging it early on to try to avoid the unpleasant back and forth we had back then and did not take us very far (CC: @mark).

I don't have solutions, but I see problems (from my side of course, YMMV). The pattern I see is as follows (T=Traffic, P=Performance):

  1. T does some important, long overdue work
  2. P notices a correlation between the work and browser-reported TTFB going up
  3. P files a high-priority task with a screenshot of the graph and no analysis
  4. P asks to revert some/all of the changes if the graph does not go back to previous values
  5. T changes something to see if it helps
  6. P pastes a new screenshot of the graph showing that it does not
  7. T gets defensive
  8. Back to point 4 or 5

This highlights a few problems in different areas of the Traffic (all of sre?) - Performance-Team relationship, including:

  • Communication between the teams
  • Expectation management
  • Definition of responsibilities

Some of the questions I asked/ask myself are:

  • Did we explain why we had to do the work? If so, shouldn't it be clear that we're likely not going to go back to a huge pile of tech debt to make slow (TTFB ~550ms) requests 50ms faster?
  • What is the job of P in the context of this issue? Is it looking at a graph and pasting screenshots to phab? That does not seem like a valuable use of their time. Shouldn't they provide more content to the conversation? There's a ton of work that could be done, such as looking at different percentiles, different cache lookup status, digging through changelogs/commit logs, trying synthetic benchmarks on a development environment, ... Or is that not their responsibility?
  • What do phab priorities actually mean? It's clear that "unbreak now" essentially means that we must stop what we're doing and work on the UBN task (let's forget for a minute outages and other unbreak now tickets one might be working on) . But what about all other priorities? Do they reflect the relative importance for users, for the author of the task, for the assignee of the task? Should a higher priority task have precedence over lower priority ones? What if the lower priority task is OKR work?
  • Are all performance regressions high priority? Does that mean that I should drop what I'm doing and investigate, as if we were in a outage? Am I supposed to do that alone?
  • What is a performance regression? "Slower than a week ago for 10 minutes in a row"? One hour? Are all regressions the same? Looking at a curve and observing that it's going up or down does not seem like a sound approach.
  • There's often a trade-off between reliability, performance and complexity. Which aspect do we need to prioritize?
  • Is the data we're looking at even reliable? What if, like in the case of this task, TTFB as perceived by the TLS terminator is fine, while TTFB as reported by clients is not?

When it comes to the very last point: (a) I currently see that Varnish 5 running on cp3052 does better than v6 on cp3050, cp3056, cp3058, and cp3060, while comparing v5 with v6 on cp3054, cp3062, and cp3064 I do not see anything conclusive. (b) Even looking at a node where browser-reported TTFB is much better on v5 (516 vs 605 ms):

By zooming into TTFB as reported by ats-tls I see the following:

  • hit-front: 33ms for both
  • hit-local: 34ms for both
  • int-front: ~34ms for both
  • miss/pass: v5 333ms, v6 is faster (!) 326ms

So, do we actually have a performance issue on Varnish 6?

What I would love us to have is better communication between the teams, and (high priority ;)) a well-defined SLO based on reliable metrics, clear rules for who should do what, and when.

It seems normal to me to paste screenshots of findings, create tasks when a significant regression is witnessed, verify the cause via a targeted rollback. Performance regressions are always complicated to analyse due to the organic nature of our traffic and the fact that our systems and code are constantly evolving. It's not the Performance team's responsibility to make decisions about what the Traffic team will do about identified performance regressions, merely to seek to fully understand its nature, magnitude and potential causes. When those regressions don't get addressed as a deliberate decision, we also have to keep track of them long term.

I really don't understand what I did wrong here, I helped improve the per host dashboard so we could see things better, I offered Brandon to build more instrumentation that could help (and got no response). I spent hours digging through the data to get the best understanding possible. In fact I spent a lot of time verifying my claim before even creating the task, I work very hard to not bother people about false positives.

When someone introduces a performance regression in MediaWiki code, I try to locate the source of the problem exactly the same way, involve people who likely caused it so they can verify that their code was the source of it and they make a decision about what to do. Most recent example: T263832: Major performance regression on mobile site associated with 1.36.0-wmf.10 This involved code that people actually wrote themselves, and they don't complain about being pinged about it, we quickly worked on a fix together.

I understand that it's not a comfortable position to deploy software you don't write and discover that it potentially introduces problems like this, but I never at any point in this task requested that this upgrade be cancelled or reverted globally, the point of the host revert is to narrow down the investigation, which is complicated on its own. I never blamed you or your team for anything negative, I've only left factual comments trying to figure out what's going on.

Can't we just focus on the data and keep trying to track this down and coming up with potential solutions if there are any? That's all I'm trying to do here.

Gilles added a comment.EditedWed, Oct 7, 10:15 AM

On the topic of reliable metrics, our RUM monitoring is reliable, battle-tested and the ultimate truth about the performance users really experience. While it can't tell us where the issue comes from between the user's device and the bits coming from our backends, when the effect is so big, it's not noise. Something is going on, if you're not finding it in ats-tls-measured latency, it means that there are more rocks to turn to find the source, not that it's a fluke.

Google collects the same RUM metrics to make an assessment about our performance (that's their system called CRuX), which they work into their search rankings. They've declared publicly that web performance will gradually get more weight in their ranking considerations in the future. A 10% regression across the board is a big deal. It's the kind of extra latency that can make us go into a lower bucket of quality of experience from Google's perspective. Our Google search ranking is Wikipedia's lifeline, the vast majority of our traffic comes from Google. If we start sinking in their rankings, it will have devastating effects on Wikipedia. So when our RUM performance gets significantly worse, it's potential fuel for a serious problem down the line and it has to be taken seriously. It can be challenging to figure out what on our end causes those RUM changes, and that's what we're doing here. As far as I'm concerned, we've reached a major point in the investigation because we've reproduced the problem (or its inverse, I guess) on a specific host. This means that we can stop looking for wild external explanations, it's extremely likely to be something coming from our own infrastructure.

while comparing v5 with v6 on cp3054, cp3062, and cp3064 I do not see anything conclusive

This is what I see computing the difference between cp3052 and those hosts with a 1d sampling size/rolling average over the last 10 days:

This is the same kind of magnitude of negative difference as with all the other hosts you mentioned.

greg added a subscriber: greg.Wed, Oct 7, 3:12 PM
Gilles renamed this task from Elevated latency starting 2020-09-28 to 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).Wed, Oct 7, 4:00 PM
Gilles updated the task description. (Show Details)

@ema @BBlack before I build it, I want to confirm that some complimentary information you're looking for is the ability to break down the RUM response start metric by "hit-front", "hit-local", etc. response type in addition to DC and host.

Gilles updated the task description. (Show Details)Wed, Oct 7, 4:19 PM
BBlack added a comment.Wed, Oct 7, 4:49 PM

@ema @BBlack before I build it, I want to confirm that some complimentary information you're looking for is the ability to break down the RUM response start metric by "hit-front", "hit-local", etc. response type in addition to DC and host.

Yes, I think those would be very enlightening dimensions on the data, for sure. I think it's probably also useful to look at more than just p75. Is the effect more or less pronounced at the p90 or p99 level? What about the p50/p10 end of things, or even the simple mean?


But more importantly, I'd like to rewind to the meta-conversation @ema was trying to start, and maybe take a different angle of attack:

The crux of the issue is that, as things stand today, we don't have any clearly-defined SLO for the Traffic layer that all stakeholders are aligned on. There are a lot of metrics, and some of them can provide hints that something at some layer may be in need of some adjustment, to some subjective measure or standard. That any given metric data is objective and factual in nature, at its own level, is only the beginning of the conversation about that data, not the end of it.

Which of these metrics say something important from the user perspective, and which do not? Which help us identify the pieces and layers of our infrastructure responsible for a metric moving in an undesirable direction? What is a reasonable standard for how those metrics should behave? From the Traffic perspective, I think we stlil don't even have reliable metrics that are specific to our layers of the stack (e.g. a metric for the latency impact of just our traffic edge layer of the infrastructure, in isolation from everything else, which can say something like "the p50 +latency in the traffic layers is consistently ~0.5ms, but the p95 jumped from 63ms to 340ms starting yesterday"). Once we had one or more such highly-specific metrics, then the next task would be deciding on a SLO we can all agree on, e.g. : "p95 request processing latency in the traffic layers should be in the range of 3-12 milliseconds".

What we have today (if you'll allow a little hypberbolic story-telling license on my part) is a highly subjective reading of the tea leaves. The tea leaves in question may be objective metric data in some sense, but there is a ton of different data we can look at, and we don't easily know which components of our stack are implicated by a change in any of the data, we're not tracking every change (good or bad), we don't have an objective standard about the acceptability of any change, etc. The stories we end up with are basically of the nature "I noticed some graph somewhere, which can be affected by many things, seems to have diverged from normalcy to some degree, and I think I can guess at a correlated event in your layer which might have been causative, and I think this whole incident rises to some subjective level that it deserves attention and/or fixing".

I don't doubt that smart people can in fact find important things to fix by reading tea leaves like this some of the time (we've done a lot of it in SRE over the years!), but it can be frustratingly subjective in every way (interpretation, prioritization, etc), especially when it crosses inter-team barriers and there's little shared context between the teams. I think a really good use of our time in the coming quarter(s) would be to start figuring out how to define the reliable metrics we need, and then merging this into the OKR work that's coming up at the broader SRE level about defining real SLOs we can agree on and rely on in general. If we solve that problem, all the rest of this melts away and becomes pretty trivial and non-combative because the subjectivity has been stripped from the process.

I'm happy to work with the Traffic team to increase knowledge on web performance, which metrics matter and why, help you define your own SLOs, etc. We'll also have a professional trainer in February give another round of long form in-depth training that people from your team will be able to sign up for.

In the meantime I would appreciate it if we could work together as peers and stop belittling each others’ expertise by calling it "reading tea leaves" and "highly subjective". The Performance team and I have an excellent track record of finding the root cause of performance regressions, confirmed by fixes that followed.

I would like this task to remain focused on what it should be: fully understanding a confirmed significant performance regression caused by a version upgrade. Once it's fully understood, you can make your own assessment about what to do with it.

I've added a dropdown to pick the percentile on https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host

Here's what it looks like for cp3052 over the period of interest:

I've also made the comparison window configurable, as a 7 days comparison starts being misleading now that the regression is more than 7 days old.

ema added a comment.Thu, Oct 8, 8:23 AM

I really don't understand what I did wrong here

It was never my intention to offend you, and given that I did, I want to apologize. I know that you're in good faith. I know that you're a professional with extensive experience. I know that you know what you're doing.

Can't we just focus on the data and keep trying to track this down and coming up with potential solutions if there are any? That's all I'm trying to do here.

Focusing on the technical aspects is the opposite of what I want to do. Sure, we can work on the technical issue. Sure, this task is not the right place to discuss team dynamics, expectation management, and responsibilities. However, I think that the problems mentioned in T264398#6524392 are real, and important. I'm happy to move the discussion anywhere else, but I don't know where.

Gilles added a comment.EditedThu, Oct 8, 9:03 AM

I don't think that this discussion is appropriate in a public forum. An email thread seems like an ok starting point, and/or a meeting. Your team needs a mailing list, emails I send to Brandon remain without a reply half the time. Last example dated September 16. Maybe because my reports are dismissed wholesale by default and as such don't deserve the courtesy of a response, I suppose.

mark added a comment.Thu, Oct 8, 11:52 AM

Hi all,

I recommend we limit the conversations on this task to the technical aspects of this particular regression and its investigation, going forward. I will followup with everyone on the discussion on team dynamics, expectations and procedures in private, by email and in meetings. Thanks! :)

I would like to know what the Traffic team is doing or planning on doing about this investigation at the moment. Now that I've narrowed it down to this Varnish update with very high confidence, this reaches the limits of what I can find on my own. The Traffic team been driving this update and has in-depth knowledge of Varnish and the edge cache stack, their expertise is required to narrow it down further.

My actions and findings are clearly documented here, I'm working to gather more information with T264987: Add cache response type and response size as new dimensions to navtiming_responsestart_by_host_seconds prometheus metric which is going to take some time to get reviewed, merged and deployed. Is this extra information blocking the Traffic team from further investigating what changed in Varnish 6 that's causing this issue? Is anyone still trying to fix T264074: varnishkafka 1.1.0 CPU usage increase with the hope that it will make this regression go away at the same time?

The Traffic team's investigation shouldn't stop at the first other source of data that suggests that nothing has changed, like self-reported ats-tls latency. This regression affects all our frontend metrics: https://performance.wikimedia.org/#!/year

Varnish 6 as it's been deployed makes the experience worse for our users. Now that the regression has been going on for a while, we can also see that it has caused a dip in user satisfaction ratio (this is measured with a survey asking thousands of users their opinion daily):

That's a 2% reduction in people who think that pages load "fast enough". Meaning that the effect is definitely big enough for users to notice.

Debating whether this is a magnitude of regression we can afford to make permanent when balanced with other factors should happen after the regression is fully understood and reasonable attempts have been made at fixing it if we have any leads. The Performance team can inform the debate of whether we can afford such a regression being made permanent and the Traffic team, as owners of the systems involved, will decide what to do and will be accountable for the final decision.

Currently, however, I don't have confirmation that the Traffic team has a good understanding as to why this is happening and I see no updates indicating where and how they're looking to figure this out. Maybe the fix is as simple as a new configuration setting in Varnish 6. But we're not getting any information here that suggests that the Traffic team is looking at things of that nature. Maybe the Traffic team doesn't know how to figure out that Varnish 6 is causing this and is out of ideas, in which case it should be stated, so that everyone else can know whether we've reached the limits of what can be found out here. It helps to share which dead ends have been encountered as well, in order to reduce duplicated efforts and maybe come up with alternative ideas on what to look into.

If the Traffic team is denying that anything is wrong, which also seems like a possibility given the nature of the few responses we've seen on this task that were actually about the subject matter, then Traffic has to explain what we're seeing on cp3052, why all frontend metrics are going up across the board as a result of this update (as demonstrated by extremely strong temporal correlation to the staged deployment) and why user sentiment is dropping.

It's not for every web performance regression that we get such a strong signal pointing in a single direction. It is however pretty common to see conflicting data when trying to look deeper inside our infrastructure, like when you're looking at ats-tls-reported latency. That's generally when we find that this internal monitoring is only a partial view, is possibly incorrect or misleading and that maybe we're not measuring some things that we should be. There should be some actionables filed as subtasks if the Traffic team identifies blind spots in the internal monitoring of the edge cache. Besides T264987, are there things we're not currently measuring that the Traffic team thinks could help with this investigation?

Lastly, Varnish 6.0 is 5 minor versions behind the current latest stable. Has upgrading Varnish further been considered? Maybe there are too many breaking VCL changes to upgrade to 6.1 easily as an experiment, but it would be helpful to know if it's been looked into. A further upgrade is something we could try out on cp3052, for example. If the current investigation has reached a dead end for the Traffic team, this is probably a good candidate for a next step, hoping that we're currently hitting a bug that's been fixed in later versions.

ema added a comment.Fri, Oct 9, 3:13 PM

With hey on cp3052 (Varnish 5.1.3-1wm15) and cp3054 (6.0.6-1wm1) I obtained the following two latency distributions in milliseconds when benchmarking cache hits for enwiki's Main_page (~80k in size):

./hey -n 1000 -c 10 -H "X-Forwarded-Proto: https" -host "en.wikipedia.org" http://localhost/wiki/Main_Page

5.1.3-1wm156.0.6-1wm1
p101.61.6
p252.02.0
p502.52.4
p753.23.1
p904.94.7
p956.66.7
p9910.911.5

Latencies for the page about U.C. Sampdoria, which is larger (~268K), look like this:

5.1.3-1wm156.0.6-1wm1
p102.32.3
p252.92.8
p503.53.2
p754.33.8
p905.74.9
p958.56.0
p9915.210.5

Results for Accept-Encoding: gzip are similarly unremarkable.

The data does not seem to show significant differences between the performance of Varnish 5 and Varnish 6 when it comes to this sort of cache hits. As a next step we can try to observe the various Varnish milestone timestamps on miss/pass traffic, and see if there's any difference there.

Gilles added a comment.EditedMon, Oct 12, 7:39 AM

Thanks for trying that out. I ran the same command for much longer with higher concurrency. For Main_Page, I saw no difference.

Then I tried our go-to large article, with the benchmark running concurrently on both hosts.

./hey -n 1000000 -c 1000 -H "X-Forwarded-Proto: https" -host "en.wikipedia.org" http://localhost/wiki/Barack_Obama

A significant difference emerged. You can feel it right away seeing how long it takes to progress through the benchmark. The total runtime for cp3054 was 25% longer.

cp3052

Summary:
  Total:	538.3917 secs
  Slowest:	3.6410 secs
  Fastest:	0.0098 secs
  Average:	0.5122 secs
  Requests/sec:	1857.3839

Status code distribution:
  [200]	1000000 responses

Response time histogram:
  0.010 [1]	|
  0.373 [408769]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.736 [380391]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.099 [141824]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.462 [40537]	|∎∎∎∎
  1.825 [17190]	|∎∎
  2.189 [8190]	|∎
  2.552 [1721]	|
  2.915 [527]	|
  3.278 [482]	|
  3.641 [368]	|

Latency distribution:
  10% in 0.0878 secs
  25% in 0.2483 secs
  50% in 0.4448 secs
  75% in 0.6846 secs
  90% in 0.9694 secs
  95% in 1.2226 secs
  99% in 1.8680 secs

cp3054

Summary:
  Total:	676.5448 secs
  Slowest:	3.6392 secs
  Fastest:	0.0100 secs
  Average:	0.6690 secs
  Requests/sec:	1478.0987

Status code distribution:
  [200]	1000000 responses

Response time histogram:
  0.010 [1]	|
  0.373 [168978]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.736 [475582]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.099 [263823]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.462 [70445]	|∎∎∎∎∎∎
  1.825 [16243]	|∎
  2.187 [3935]	|
  2.550 [790]	|
  2.913 [166]	|
  3.276 [29]	|
  3.639 [8]	|

Latency distribution:
  10% in 0.2956 secs
  25% in 0.4492 secs
  50% in 0.6256 secs
  75% in 0.8387 secs
  90% in 1.0774 secs
  95% in 1.2484 secs
  99% in 1.6502 secs

Maybe this issue only affects larger payloads?

Is the Barack Obama article too big to be stored in Varnish? I can't remember what the limit is.

I've checked with curl and it does get cached by both hosts:

X-Cache: cp3060 miss, cp3052 hit/1000011
X-Cache-Status: hit-front

X-Cache: cp3060 miss, cp3054 hit/1000019
X-Cache-Status: hit-front

Oddly, cp3054 has Content-Length defined in the headers I get back and not cp3052?

ema added a comment.Mon, Oct 12, 8:27 AM

Thanks for trying that out. I ran the same command for much longer with higher concurrency. For Main_Page, I saw no difference.

Then I tried our go-to large article, with the benchmark running concurrently on both hosts.

./hey -n 1000000 -c 1000 -H "X-Forwarded-Proto: https" -host "en.wikipedia.org" http://localhost/wiki/Barack_Obama

Oh, it seems that you've tried that between 07:25 and 07:35? During that timeframe, p75 webperf on cp3052 (Varnish 5) was bad, up to 4.3 seconds (!). See the usual dashboard, but with the rate computed over 5 minutes instead of 1 hour:

I don't see the same effect on cp3054, interestingly. Also during the benchmark we've had a brief 502 spike on cp3052, while that did not happen on cp3054.

This might indicate a difference in the way v5 and v6 handle significant rates of concurrent requests with large responses. Can the work currently in progress on splitting webperf by cache_status (T264987) be extended to also include response size?

While I do think the data is very interesting, we should move the benchmarks elsewhere to avoid disrupting production traffic (I'm not blaming Gilles for trying the command, just to be very clear). Perhaps deployment-cache-text06.deployment-prep.eqiad.wmflabs might be a good labs instance to use for this purpose, hoping that its limited resources do not end up interfering with the results.

On a different angle: while looking at various Varnish internal metrics, I've noticed that there's a very visible decrease in LRU lock operations on Varnish 6 compared to Varnish 5. See increase(varnish_lock_operations{instance=~"cp305[26]:9331",target="lru"}[5m])), the difference after the downgrade of cp3052 is quite clear:

Going through the various commits that led to the 6.0.x series, it seems to me that the most likely reason is https://github.com/varnishcache/varnish-cache/issues/2527. If anything, less lock contention should result in better performance on v6. However, as an experiment I've tried disabling lru_interval (ie: setting it to 0) on cp3054. Let's see what difference that makes performance-wise, if any. So far, I don't see anything interesting at all, so this is likely a red herring.

Gilles added a comment.EditedMon, Oct 12, 8:53 AM

I tried a bunch of different benchmark settings before I ended up using those, some of which definitely had too much concurrency and caused errors even on the benchmark responses themselves. So I'm not sure if you're seeing the failed attempts where the settings were really excessive or the final run on the Barack Obama article. It's probably worth running again. The fact that these latest settings led to 100% 200 responses contrary to the excessive ones leads me to believe that they're safe.

My shell history on those hosts isn't timestamped, so I can't confirm based on that.

I can definitely add response size as another per-host dimension.

Gilles added a comment.EditedMon, Oct 12, 9:04 AM

If it was indeed the last benchmark run, what you're saying is that cp3052 favored the benchmark requests over others, while cp3054 remained more fair in regards to the rest of the traffic? I think that even moving this to WMCS we're always going to run into problems like that with benchmarks. We're more likely to discover bias caused by the benchmark itself than track down the root cause of what we're seeing in organic traffic.

ema added a comment.Mon, Oct 12, 9:47 AM

I've checked with curl and it does get cached by both hosts:

[...]

Oddly, cp3054 has Content-Length defined in the headers I get back and not cp3052?

Interesting, which curl command exactly gives you no CL on 3052?

I see CL being returned by Varnish on both hosts:

09:43:05 ema@cp3054.esams.wmnet:~
$ curl -v -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" localhost/wiki/Barack_Obama 2>&1 | grep Length
< Content-Length: 1479288
09:43:19 ema@cp3052.esams.wmnet:~
$ curl -v -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" localhost/wiki/Barack_Obama 2>&1 | grep Length
< Content-Length: 1479288

It was:

gilles@cp3052:/home/ema$ curl -I -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" http://localhost/wiki/Barack_Obama | grep Content-Length
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
gilles@cp3052:/home/ema$

gilles@cp3054:/home/ema$ curl -I -H "X-Forwarded-Proto: https" -H "Host: en.wikipedia.org" http://localhost/wiki/Barack_Obama | grep Content-Length
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 1444k    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
Content-Length: 1479263
gilles@cp3054:/home/ema$
Gilles added a comment.EditedMon, Oct 12, 10:24 AM

I guess the difference is that I did a HEAD request. It's still reproducible right now.

ema added a comment.Mon, Oct 12, 12:39 PM

I guess the difference is that I did a HEAD request. It's still reproducible right now.

OK, I've tried to see if other responses without body like 304s to conditional requests also showed any interesting difference, but they don't. So this might just be a change in interpretation of RFCs when it comes to HEAD, but nothing relevent for this task I think.

Gilles added a comment.EditedTue, Oct 13, 12:42 PM

I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4 variants of this command for hit-front, hit-local, miss and pass:

varnishlog -n frontend -q "RespHeader eq 'X-Cache-Status: hit-front' and ReqURL ~ '^/wiki/*'" -i Timestamp | grep Resp | awk '{print $5}' > cp3052-hit-front.log

The above recorded the total response time from Varnish's perspective (duration shown on last Timestamp entry available for the transaction), for hit-front responses on /wiki/* URLs.

Then I looked at percentiles on each dataset. The lower number is bolded.

hit-front

hostsamplesminp25medianmeanp75maxstddev
cp30529056710.0000000.0002540.0002910.0010170.0003365.4204070.0305825
cp30548892870.0000000.0002560.0002910.0009290.0003348.0071570.0303075
Difference0.8%0%-9.5%-0.59%47%-0.9%

hit-local

hostsamplesminp25medianmeanp75maxstddev
cp3052204800.0006110.0032310.0062070.0363190.0506161.9453260.07614508
cp3054241200.0006020.0029790.0057850.0373720.0410391.9738920.08716184
Difference-1.5%-8.5%-7.3%2.9%-23.3 %1.5%14.5%

miss

hostsamplesminp25medianmeanp75maxstddev
cp30522430290.11620.31250.41360.48760.55007.27660.319049
cp30542516760.11610.29540.40330.47490.54594.98990.2933685
Difference-0.08%-5.8%-2.5%-2.7%-0.7%-45.8%-10.9%

pass

hostsamplesminp25medianmeanp75maxstddev
cp3052678110.11640.19140.33250.46670.55237.40850.5117037
cp3054641700.11590.17840.29680.42950.53924.99150.3971842
Difference-0.4%-7.3%-12%-8.7%-2.4%-48.4%-28.8%

I don't know if the time reported by Varnish includes the transmission time to the client, but as far as its own telemetry goes, cp3054 looks faster or at least as performant.

Seeing this, I pulled the RUM data for cp3052 and cp3054 for some offline processing, using this HQL query that reproduces the filtering we do when generating the Prometheus metric that ends up on the per-host dashboard. It looks at data from 2020-10-07 onwards:

SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responsestart - event.fetchstart >= 0;
hostsamplesminp25medianmeanp75maxstddev
cp305212845201082574695201482641343
cp305412817601152854835436256642129
Difference0.9%10.9%2.9%4.4%333%58.5%

I've also looked at the same data, this time including the oversamples. That extra data isn't included in the Prometheus per-host metrics.

hostsamplesminp25medianmeanp75maxstddev
cp305267010801784035976502451491392
cp305467434101894206046609979722184
Difference6.2%4.2%1.2%1.5%307%56.9%

Percentiles are a little higher for cp3054, less pronounced when including the geographically-biased oversamples. The maximum value and standard deviation are a lot higher in both cases.

It's possible that the extra time comes from something Varnish doesn't measure. It's unclear to me whether the last Timestamp in a Varnish response includes the time it took to actually ship the bytes to the client (ats-tls in this case?) and have them acknowledged.

ema added a comment.Tue, Oct 20, 10:06 AM

I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4 variants of this command for hit-front, hit-local, miss and pass:

varnishlog -n frontend -q "RespHeader eq 'X-Cache-Status: hit-front'" -q "ReqURL ~ '^/wiki/*'" -i Timestamp | grep Resp | awk '{print $5}' > cp3052-hit-front.log

[...]

It's possible that the extra time comes from something Varnish doesn't measure. It's unclear to me whether the last Timestamp in a Varnish response includes the time it took to actually ship the bytes to the client (ats-tls in this case?) and have them acknowledged.

The command above prints the 5th field of the Resp timestamp as reported by Varnish. From vsl(7):

The reported fields show the absolute time of the event,  the  time
spent  since  the  start of the task and the time spent since
the last timestamp was logged.

Log lines look like this:

-   Timestamp      Resp: 1603186764.780741 0.000476 0.000081

And Resp is:

Resp   Delivery of response to the client finished.

So that would be the time spent between the start of the task and the full delivery of the response to the client including the whole body. The client is ats-tls for https traffic, the actual user-agent for http.

My understanding is that the RUM data instead shows the time spent between the beginning of the client request, event.fetchstart, and the moment when the first bytes are received, event.responsestart. It seems to me that if we want to look into Varnish's own telemetry and compare things with event.responsestart - event.fetchstart, then we should use the so called Process timestamp:

Process
       Processing  finished,  ready to deliver the client re‐
       sponse.
ema added a comment.Tue, Oct 20, 1:08 PM
SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responsestart - event.fetchstart >= 0;

I've noticed that on nodes with Varnish 6 the worst time_firstbyte values reported by ats-tls are very often around 26 seconds, and they're due to etherpad. Can you try this once again, but excluding Host: etherpad.wikimedia.org?

Change 635298 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: fix websockets on 6.x

https://gerrit.wikimedia.org/r/635298

SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responsestart - event.fetchstart >= 0;

I've noticed that on nodes with Varnish 6 the worst time_firstbyte values reported by ats-tls are very often around 26 seconds, and they're due to etherpad. Can you try this once again, but excluding Host: etherpad.wikimedia.org?

We don't collect RUM data on etherpad.wikimedia.org. As for Varnish data, it checks for ReqURL ~ '^/wiki/*. As far as I know, etherpad URLs all start with /p/.

Fair point about TTFB from Varnish's perspective, I'll re-run the collection of data I did before, this time on Process instead of Resp.

Change 635298 merged by Ema:
[operations/puppet@production] varnish: fix websockets on 6.x

https://gerrit.wikimedia.org/r/635298

Gilles added a comment.EditedTue, Oct 20, 3:23 PM

Here's the same data collected with commands like the following (using Process), over a 30 minute period.

varnishlog -n frontend -q "RespHeader eq 'X-Cache-Status: hit-front' and ReqURL ~ '^/wiki/*'" -i Timestamp | grep Process | awk '{print $5}' > cp3052-hit-front.log

hit-front

hostsamplesminp25medianmeanp75maxstddev
cp305211750960.0001380.0002160.0002470.0006970.0002784.2690780.01909279
cp305411332260.0000000.0002210.0002490.0011070.00027911.8366550.03249044
Difference-3.7%-100%2.3%0.8%58.8%0.4%177.2%70.2%

hit-local

hostsamplesminp25medianmeanp75maxstddev
cp3052327680.0005490.0010420.0012370.0147840.0018081.7241090.05559903
cp3054323120.0005810.0010740.0012900.0152970.0018341.5552660.0528903
Difference-1.4%5.8%3.1%4.3%3.7%1.4%-10.8%-5.1%

miss

hostsamplesminp25medianmeanp75maxstddev
cp30522826240.11620.28160.34930.41390.43577.07940.2720957
cp30543290450.11610.28290.34500.40530.42974.21020.2469989
Difference16.4%-0.1%0.5%-1.2%-2.1%-1.4%-68.1%-10.16%

pass

hostsamplesminp25medianmeanp75maxstddev
cp3052832850.11600.20230.31600.39170.43505.77820.3779185
cp3054992140.00000.18150.30580.36300.43154.62770.2776744
Difference19.1%-100%-11.4%-3.3%-7.9%-0.8%-24.8%-36.1%

cp3054 seems to be consistently a little faster for miss and pass, and overall a little slower for hit-front and hit-local. But I still can't see anything in the order of the extra tens of milliseconds seen on clients.

I thought I could also look at RUM data all the way to responseEnd, which isn't information we have per host on Grafana. This can show us whether the higher TTFB might be compensated by subsequent faster delivery of data.

SELECT event.responseend - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responseend - event.fetchstart >= 0;

There were a few freak high values that I filtered out (I removed anything greater than 10,000,000. i.e. 2.7 hours...)

Without geo oversamples

hostsamplesminp25medianmeanp75maxstddev
cp30522768040128306589618761029814665
cp30542771330146335621642786657516909
Differencen/a0%14%9.5%5.4%3.9%3.4%15.3%

With geo oversamples

hostsamplesminp25medianmeanp75maxstddev
cp30521464285021046169876776102986939
cp30541466906022248371377878665758016
Differencen/a0%5.7%4.8%2.1%1.4%3.4%15.5%

It doesn't seem like data is delivered faster, as the difference remains all the way to responseend.

I stumbled on T266040 while looking at something unrelated, but now I'm remembering that earlier in this ticket, there was some mention of a possible correlation with larger response sizes here in this report as well. The stuff in T266040 would have some statistical negative effect primarily on objects >= 256KB (the hieradata-controlled large_objects_cutoff), and a secondary effect on overall backend caching efficiency for everything else (by wasting space on pointless duplications of content). AFAIK what we're looking at in that ticket isn't something that changed from V5 to V6, though... but I could be wrong, or it could be that some other subtle change in V6 behavior exacerbated the impact.

Seeing that for some reason in my 30 minute test cp3054 was getting significantly more miss and pass requests than cp3052, I've just realised that I don't think I've looked at frontend hit rate before/after the Varnish 6 deployment.

Looking at the frontend hit rate, I think the change observed there might explain this regression. It's particularly striking for eqsin before/after the Varnish 6 migration (looking at 2xx responses):

It seems like we lost about 2% of frontend hits at that point.

It's more subtle for esams (although might be more visible by sampling per hour/day and not by 5m chunk as the graph currently does), but it does happen on the later date for that cluster's Varnish 6 migration:

I think that a 1-2% frontend hit rate loss on 2xx responses could very well explain what we're seeing in aggregate client side performance.

Comparing percentages for that 30-minute test, which was only looking at hit-front/hit-local/miss/pass for requests to /wiki/ URLs (approximation of wiki article URLs) since October 7:

hosthit-fronthit-localmisspass
cp305274.6%2.1%18%5.3%
cp305471.1%2.1%20.6%6.2%

I'm not sure how frontend servers are picked to serve requests (hashed by IP? URL?), but this seems like a noteworthy piece of data, which might be further confirmation of what seems to be a cluster-wide phenomenon in my previous comment.

This would suggest that while per response type our Varnish 6 deployment performs the same or better as Varnish 6, it might have more trouble keeping as many hot items in cache as previously? Do we have a dashboard/stats about how many objects Varnish frontends currently have in memory?

Gilles added a comment.EditedTue, Oct 20, 5:37 PM

I've found the dashboard for total objects, and it seems like as many objects are stored now as there were before the Varnish 6 deployment. Maybe it could be an issue of objects that are stored but not found when access is attempted? I'm thinking of URL encoding issues, for example. Something we were doing in VCL that subtly stopped working in some edge cases in Varnish 6, for example.

It's a bit of a far-fetched argument, but if it is encoded-related, it might explain why eqsin is being affected more, as Asian traffic is more likely to be dealing with UTF-8 characters in URLs. Maybe this theory can be verified by looking at URL characters in wiki URLs at the Varnish level and checking whether the presence of some specific characters in the URL results in a higher rate of misses.

I'm not sure how frontend servers are picked to serve requests (hashed by IP? URL?), but this seems like a noteworthy piece of data, which might be further confirmation of what seems to be a cluster-wide phenomenon in my previous comment.

Frontend servers are hashed based on client source IP (so the same user should always land on the same frontend).

Backends are hashed based on content URL (mostly).

ema added a comment.Wed, Oct 21, 10:06 AM

Change 635298 merged by Ema:
[operations/puppet@production] varnish: fix websockets on 6.x

https://gerrit.wikimedia.org/r/635298

Other than fixing websockets, this change deployed 2020-10-20 at 14:33 has improved things for the very worst requests (though they might not show up in RUM as they were mostly etherpad from what I could see). The absolute-worst time_firstbytes reported by ats-tls yesterday were around 26 seconds on Varnish 6 vs ~7 seconds on Varnish 5. Today there's no clear difference between v5 and v6.

Seeing that for some reason in my 30 minute test cp3054 was getting significantly more miss and pass requests than cp3052, I've just realised that I don't think I've looked at frontend hit rate before/after the Varnish 6 deployment.

Looking at the frontend hit rate, I think the change observed there might explain this regression.

I did notice the hitrate difference some weeks ago, but mentally dismissed it given that I could not find a correlation between time periods when the hitrate on v6 is higher than on v5 and the way the RUM graph looks like. For example, between 2020-10-20T12:58 and 2020-10-20T18:43, the hitrate on cp3058 was ~1% higher than on cp3052 and yet cp3052 seems to perform much better. It is however a very interesting angle to take in the investigation of course, I'll look into this next.

ema added a comment.Wed, Oct 21, 1:18 PM

I found that there's a significant difference between the number of n_objecthead on v5 and v6:

In particular, on v6 n_objecthead is roughly the same as n_objectcore (which is desirable). On v5 there's ~2x more objecthead. This is due to a bugfix, or various patches really, in v6 (see https://github.com/varnishcache/varnish-cache/issues/2754) but it seemed like a very clear change worth pointing out. Perhaps the fixes incidentally broke some Vary-related feature?

As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was returning hits on things that it shouldn't have, that are now correctly returned as misses. It would be really interesting to figure out if there's any pattern in what tends to be a hit on v5 vs v6. Understanding that delta might let us know if we're looking at a bug or a bugfix.

ema added a comment.Wed, Oct 21, 3:29 PM

As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was returning hits on things that it shouldn't have, that are now correctly returned as misses. It would be really interesting to figure out if there's any pattern in what tends to be a hit on v5 vs v6. Understanding that delta might let us know if we're looking at a bug or a bugfix.

Yeah, following various rabbit holes we found for example that when an object is out of grace but in keep, starting with 6.0.1 that's considered a miss instead of a hit. This is visible in the varnish_main_cache_hit_grace difference between v5 and v6. It does not explain really the difference in frontend hits, but it's something.

ema added a comment.Thu, Oct 22, 9:13 AM

cp3054 seems to be consistently a little faster for miss and pass, and overall a little slower for hit-front and hit-local. But I still can't see anything in the order of the extra tens of milliseconds seen on clients.

It is possible that the telemetry of both ats-tls and varnish-fe are actually accurate. However, they mostly capture the situation on the local node, while RUM data includes delays introduced over the WAN. Perhaps that is the area where v5 and v6 subtly differ. I'm gonna focus on trying to reproduce the issue from the perspective of a client, targeting specific nodes. See the instructions on Wikitech, but the general idea is:

sudo setcap 'cap_net_bind_service=+ep' /usr/bin/ssh
ssh -L 443:localhost:443 cp3052.esams.wmnet
curl --resolve en.wikipedia.org:443:127.0.0.1 https://en.wikipedia.org/wiki/Main_Page

The curl above is just for illustration purposes, it may or may not be useful to measure full connection establishments.

The telemetry of both ATS-TLS and Varnish is also blind to app-level, OS-level or hardware-level delays and buffering internal to our systems. They both report timestamps before writing to their client socket, but offer no visibility about when those bytes are actually received by the next layer.

That's why I suggested that we collect data about the ATS-TLS timestamps regarding its communication with Varnish, as it would allow us to see if there is a gap between the time Varnish is about to send data and when ATS-TLS has actually received it.

Before we dig into that, the "total time" we're going to collect from ATS-TLS should be enough to know whether it can contain the extra latency or not. If it can't possibly contain it, then there's no point investigating delays or buffering between ATS-TLS and Varnish or other parts of the "total" timeframe.

That being said, I still think that the drop of hit rate is a more likely explanation than individual request performance.

ema added a comment.Thu, Oct 22, 9:45 AM

Before we dig into that, the "total time" we're going to collect from ATS-TLS should be enough to know whether it can contain the extra latency or not. If it can't possibly contain it, then there's no point investigating delays or buffering between ATS-TLS and Varnish or other parts of the "total" timeframe.

Fair enough.

That being said, I still think that the drop of hit rate is a more likely explanation than individual request performance.

Even if there seems to be no clear correlation between better v6 hitrate and RUM performance as mentioned in T264398#6567470?

Gilles added a comment.EditedThu, Oct 22, 11:13 AM

I think that a major caveat with that 6-hour window counter-example is that it's looking at hit rate over everything and RUM data is only looking at wiki articles. The "live" 30-minute window experiment was filtering the URLs in varnishlog.

Also, looking at a 1 hour sampling size over a 6 hour period might be a bit extreme in terms of rounding. Switch to 5 minute sampling, and while the difference in hit rate stays consistent, the response start difference becomes very noisy and the reverse correlation to hit rate difference isn't there:

For reference, 5-minute window during that particular period contains an average of 244 samples per host for RUM data:

SELECT COUNT(*) FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") >= 1603198731 AND UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") <= 1603219577 AND recvfrom = 'cp3052.esams.wmnet'

44075

Since we do have hit/miss data in RUM Hive for browsers that support Server Timing, let's look at it in the pre-v6 period and more recent times:

SELECT COUNT(*), PERCENTILE(nt.event.responsestart, 0.5) AS median_responsestart, nt.recvfrom, st.event.description
FROM event.servertiming AS st
JOIN event.navigationtiming AS nt ON st.event.pageviewtoken = nt.event.pageviewtoken
WHERE nt.year = 2020 AND nt.month = 9 AND nt.day >= 10 AND nt.day <= 24
AND st.year = 2020 AND st.month = 9 AND st.day >= 10 AND st.day <= 24
AND nt.event.isoversample = false
AND nt.recvfrom LIKE '%.esams.wmnet'
GROUP BY nt.recvfrom, st.event.description
LIMIT 100;
Hosthit-front rate 2020-09-10-> 2020-09-24hit-front rate 2020-10-07 -> 2020-10-21
cp305072.8%68.4%
cp305272.6%72.8%
cp305472.8%68.4%
cp305672.3%68.4%
cp305872.8%68.4%
cp306072.5%68.3%
cp306272.8%68.4%
cp306472.6%68.6%

Now, as one would expect given how requests are load balanced to hosts by IP, the total amount of requests sampled by each host over a 2-week period is very consistent. Which means that we can see what those requests that used to be hit-front became. Looking at the 2020-10-07 -> 2020-10-21 period:

Hosthit-fronthit-localmisspass
cp305272.8%0.9%21.6%4.7%
cp305468.4%1.2%25.6%4.8%

In other words, a 4% increase of misses and a possible 0.3% increase of backend hits.

Furthermore, let's look at the median responseStart data for the different response types, still during the 2020-10-07 -> 2020-10-21 period:

Hosthit-fronthit-localmisspass
cp3052234316629753
cp3054234329629748

I think this is solid confirmation that the global RUM performance change that started this investigation isn't a performance change of individual Varnish response types, but a lowered frontend hit rate for wiki articles. Varnish 6 still serves frontend hits, backend hits, misses and passes as fast as it used to. It's just serving 4% less frontend hits, in favor of misses when dealing with wiki pages.

I've compared 1+ million miss requests on cp3052 and cp3054 and looking at the most frequent miss URLs, there's no distinguishable pattern. Miss URLs are mostly a very flat and long tail, though.

ema added a comment.EditedTue, Oct 27, 10:21 AM

Given that the amount of changes between 5.1.3 and 6.0.6 is considerable, I was thinking of following this "bisect-like" apporach: package Varnish 6.0.2, try it out on a node, see if there's any difference. If 6.0.2 performs better, than the regression happened between 6.0.3 and 6.0.6, otherwise earlier than 6.0.2.

I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:

[...]
Oct 27 09:43:37 cp4032 varnish-frontend[41162]:         VMOD wants ABI version 7.1
Oct 27 09:43:37 cp4032 varnish-frontend[41162]:         varnishd provices ABI version 7.0
[...]

The ABI change is not documented anywhere in the changelog, other than with the following wording in 6.0.4:

* New VRT functions for handling STRANDS (split strings) in
  VMODs. `vmod_blob` now uses strands internally.

At this point we should rebuild the whole circus (varnish-modules and our VMODs) against a varnish version earlier than 6.0.4 if we want to follow the bisect approach.

I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:

Macro facepalm:

Mentioned in SAL (#wikimedia-operations) [2020-10-27T15:25:33Z] <ema> cp4032: downgrade varnish to 6.0.4 T264398

ema added a comment.Tue, Oct 27, 3:30 PM

With T266567 out of the way, we can now try different Varnish 6 versions, at least as long as they're VRT-compatible.