Page MenuHomePhabricator

[Research] Investigate 10% regression for ResourceLoader cache hit rate
Closed, ResolvedPublic

Description

https://grafana.wikimedia.org/dashboard/db/resourceloader?from=now-30d


  • Ratio of If-None-Match requests (potential 304s) dropped from 35% to 25%.
  • Percentage of 304 responses from INM requests dropped from 87% to 77%.
  • HTTP 200 responses raised.
  • HTTP 304 responses dropped.

Event Timeline

Krinkle created this task.Nov 5 2015, 12:39 AM
Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 5 2015, 12:39 AM
Krinkle renamed this task from [Research] Investigate 10% in ResourceLoader cache hit rate drop to [Research] Investigate 10% regression for ResourceLoader cache hit rate.Nov 5 2015, 12:39 AM
Krinkle set Security to None.
Krinkle updated the task description. (Show Details)Nov 5 2015, 7:13 PM
ori triaged this task as High priority.Nov 9 2015, 7:43 PM
ori moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.
Krinkle claimed this task.Nov 16 2015, 7:50 PM

When looking through Varnish logs for TxStatus:200 responses to ^/w/load.php requests these are almost all requests for either stylesheets (which don't have &version=), or the startup module.

I'm not sure why these are more commonly getting 200 instead of 304, but I think the problem lies outside ResourceLoader. The requests in question also don't have a If-None-Match header. In other words, it's not so much that the request is a cache-miss (when it could've been a cache-hit). But rather, the request url for "the stylesheet" varies more often that it did before.

I can reproduce this by browsing to these urls:

The main stylesheet url is different each time. I think this is an unintentional regression from recent move toward fragmenting our modules and conditionally loading them based on page context. A few modules that are not loaded on all of the above:

  • ext.inputBox.styles (only on main page)
  • mediawiki.ui.button, mediawiki.ui.checkbox, mediawiki.ui.input (only on talk page)
  • mediawiki.raggett (only on main and talk page)
  • mediawiki.feedlink (only on history page)
  • mediawiki.action.edit.styles (only on edit page)

This is essentially because we do batching. And there is no localStorage de-duplication for stylesheets. Using JavaScript for these styles is tricky as it would depend on the startup module, and we'd likely flash without styles before the first proper render.

Disabling batching is also an option for SPDY. But for that to work we'll probably want to vary on SPDY/non-SPDY as adding 16 requests to the page is quite significant for non-SPDY users. We'll also want to consider impact on html pay load size.

Related:

ori awarded a token.Nov 17 2015, 5:31 PM
This comment was removed by Krinkle.

I don't know why I didn't see this earlier, but I know believe I wrongly categorised this change in traffic as a regression.

In the first graph on the opening post I focussed on If-None-Match request volume, as well that volume as percentage of the total traffic. That going down sounds negative without context because one tends to assume those potential cache hits now become cache misses.

In fact, it was quite the opposite. While If-None-Match request rate went down from 8K/sec to 4K/sec, the total request rate also went down from 25K/sec to 21K/sec. So this traffic didn't turn from potential cache-hit (If-None-Match request) to a cache-miss (200 OK response), but rather it disappeared entirely.

https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1444432085708&to=1449619385708


Since no such change is observed in other metrics, this isn't caused by lower traffic overall (e.g. less page views). The drop is exclusively observed in this area and additionally reflected in the 200 OK / 300 Not-Modified response break down:

So where did those requests go? They're no longer made. The resources are presumably still demanded by the content, but they now more often hit local browser caches (and localStorage), thus the http request itself is never recorded on our end. Yay for better cache hits and less trashing.

The only mystery left is 304 Satisfaction rate (which is scoped to INM requests and should not have changed). That metric did drop by about 10% as well. We can only assume that one of the well-cachable resources got even better and thus enjoying local cache hits more often and no longer participating in the 304 realm as much. Thus skewing the satisfaction rate of other INM requests.

It's important to remember that aside from 200 and 304 requests, there is another category of traffic: Requests never made. Moving things into that category is awesome.

Krinkle closed this task as Resolved.Dec 9 2015, 12:10 AM
Peter added a subscriber: Peter.Dec 9 2015, 7:07 AM