Page MenuHomePhabricator

Verify that hit/miss stats in WebRequest are correct
Closed, ResolvedPublic


In order to make a decision regarding MCS storage model simplification, it would be nice to know the estimation of the Varnish hit ratio for MCS endpoints. we can gather the data using a fairly simple hadoop query, however, in a conversation with @JAllemandou during the all hands we established that we need to double-check whether the hit/miss data in webrequest table is correct.

@JAllemandou could you please point me to a right direction in checking that?

Event Timeline

Hey @Pchelolo - I think talking to the traffic team should be the way to go here.
I ran a query to get results for x_cache fields for 1 hour of webrequest (in spark):

spark.sql("select x_cache, count(1) as c from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 2 and day = 14 and hour = 7 group by x_cache order by c desc").show(50, false)

|x_cache                 |c      |
|cp3040 int              |2219832|
|cp3042 int              |1959479|
|cp3041 int              |1957246|
|cp3032 int              |1804951|
|cp3030 int              |1775279|
|cp3033 int              |1745421|
|cp5011 int              |955867 |
|cp5009 int              |946152 |
|cp5008 int              |938290 |
|cp5007 int              |925700 |
|cp5012 int              |902610 |
|cp1079 int              |459220 |
|cp4027 int              |371752 |
|cp1077 int              |345987 |
|cp1089 int              |333299 |
|cp1087 int              |327630 |
|cp1081 int              |313944 |
|cp1075 int              |302295 |
|cp1085 int              |287920 |
|cp1083 int              |277699 |
|cp4030 int              |262280 |
|cp4029 int              |230621 |
|cp4028 int              |227874 |
|cp4031 int              |225774 |
|cp4032 int              |216522 |
|cp1075 pass, cp1079 pass|182577 |
|cp1077 pass, cp1079 pass|181598 |
|cp1085 pass, cp1079 pass|176049 |
|cp1089 pass, cp1079 pass|173175 |
|cp1081 pass, cp1079 pass|168986 |
|cp1087 pass, cp1079 pass|158815 |
|cp1079 pass, cp1079 pass|158121 |
|cp1083 pass, cp1079 pass|149125 |
|cp1077 pass, cp1083 pass|142885 |
|cp1075 pass, cp1083 pass|142499 |
|cp1085 pass, cp1083 pass|136956 |
|cp1089 pass, cp1083 pass|134096 |
|cp1077 pass, cp1087 pass|132829 |
|cp1075 pass, cp1087 pass|131839 |
|cp1081 pass, cp1083 pass|130396 |
|cp1077 pass, cp1077 pass|129506 |
|cp1075 pass, cp1077 pass|129395 |
|cp1085 pass, cp1087 pass|126947 |
|cp1085 pass, cp1077 pass|126462 |
|cp1089 pass, cp1077 pass|125114 |
|cp1089 pass, cp1087 pass|124477 |
|cp1081 pass, cp1077 pass|122288 |
|cp1079 pass, cp1083 pass|121160 |
|cp1081 pass, cp1087 pass|120678 |
|cp1087 pass, cp1083 pass|120093 |

@BBlack do you have any concerns related to the hit/miss data sent to webrequest?

fdans moved this task from Incoming to Radar on the Analytics board.
jbond triaged this task as Medium priority.Mar 4 2019, 4:47 PM

The raw data should be accurate. I had thought we were already sending the summarized X-Cache-Status to hadoop as well, but apparently not. It might be useful to get that going in another ticket, because it saves dealing with some of the complexity below. In the meantime:

The raw X-Cache data is subject to some interpretation to determine status. The code which interprets X-Cache into X-Cache-Status in Varnish is here: .

If you're just interested in the status and not the layer and other deep details, the cp hostnames are irrelevant and the basic logic for parsing X-Cache into a final status is to classify by sub-string matches in the correct precedence order like so:

if (X-Cache-Status ~ "hit") {
        Status = "hit";
} elsif (X-Cache-Status ~ "int") {
        Status = "int";
} elsif (X-Cache-Status ~ "miss") { 
        Status = "miss";
} elsif (X-Cache-Status ~ "pass") { 
        Status = "pass";
} else {
        Status = "unknown";

unknown means there's a bug in our VCL related to tracking the status (which does sometimes happen, but usually not at sufficient rate to cause analysis issues), you're probably best ignoring these unless they're significant in the subset you're looking at. int are responses that were internally-synthesized by Varnish, which usually means things like HTTP -> HTTPS redirects and client and/or Varnish transient hard errors of various kinds, and are probably also best ignored unless there's a reason they're significant for you. pass is traffic that the Varnish layer has decided is uncacheable by its nature, while miss means Varnish thinks it was potentially cacheable but didn't have a hit in the cache storage. There are different valid viewpoints on what one means by "hitrate" in different contexts, but two obvious ones that might be useful here:

  1. hit / (hit + miss) - This is usually what we call "True Hitrate", as it represents the rate achieved out of what it was even possible to cache.
  2. (hit + int) / (hit + int + miss + pass) - This means something more like "Out of all public requests, what percentage of them didn't generate an internal request to the application layer (regardless of the reason why)?"

Depending on the point of view you're looking from, you could define the "hitrate" as either hit/(hit + miss), which is a number that could at least theoretically approach 100% and represents the true hitrate of cacheable objects.

Is there anything to do here? :-)

Pchelolo claimed this task.

Thank you, everyone! :)