Page MenuHomePhabricator

Verify traffic to static resources from past branches does indeed drain
Closed, ResolvedPublic


Follows-up T46570: Time prior to removal of old wmfbranch directories from cluster MUST be higher than longest cache of ANY kind; leads to missing resources

We've been removing wmf branches regularly, but I don't feel very confident in how we're doing this right now. Even if not on a regular basis, could should put some monitoring in place to verify how long it takes for traffic to an old wmf branch to train after having rolled out a new branch to all wikis.

It should take about 30 days (given there are hardcoded references in the cached html).

Get data to answer:

  • How fast do we roll over? Does most the majority of requests hit fresh versions within the first few days? Or does the majority of requests go to old versions until the cache expires?
  • How long the tail of left overs? (presumably only from third-party html caches)
  • Do all our own caches really expire within 30 days, or does the post-30-day tail contain requests with referrers from our own domains as well?

Related Objects

Event Timeline

Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added projects: Varnish, Deployments.
Krinkle added a subscriber: Krinkle.

Who should be working on this? I have no idea how to get this data.

Presumably by adding a tail subscriber to the varnish stream.

Basically we'd collect counts for requests to /(w/)?/static/1.*wmf* and group by directory name and the reported http status code.

Hey guys, so we are trying to get away from servicing ad-hoc data requests, but we're more than happy to show you around the infrastructure so you can get what you need. Other teams have analysts that go through us and we just help them navigate the infrastructure. If you designate someone to be your analyst, just ping me on this task and I'll help out.

Adding Release-Engineering since Deployment-Systems was removed. This is related to the weekly deployment train and how train operators may be removing wmf branches from the servers prematurely, causing 404 errors.

@Krinkle: I don't think anyone has been removing branches prematurely.

Also, I am proposing that we switch to long lived branches (see T89945) which should negate the problem entirely.

To investigate this, @Krinkle and I collected 10 minutes' worth of requests to poweredby_mediawiki_*.png on cp1065 using the following command:

$ timeout --foreground 600 varnishlog -nfrontend -mRxURL:"resources/assets/poweredby_mediawiki" \
    | grep --line-buffered RxURL > 5_min_poweredby_reqs.txt

We then counted requests for each branch like this:

$ grep -Po '1.\d+wmf\d+' 5_min_poweredby_reqs.txt | sort | uniq -c | sort -rn
  19341 1.26wmf14
    551 1.26wmf15
     19 1.26wmf13
      8 1.26wmf4
      5 1.26wmf7
      3 1.26wmf1
      2 1.26wmf12
      2 1.26wmf10
      1 1.26wmf8
      1 1.26wmf5
      1 1.26wmf11

Sorted differently (branch order):

1.26wmf1 3
1.26wmf4 8
1.26wmf5 1
1.26wmf7 5
1.26wmf8 1
1.26wmf10 2
1.26wmf11 1
1.26wmf12 2
1.26wmf13 19
1.26wmf14 19341
1.26wmf15 551

The 3 requests to wmf1 are to a branch that was replaced (no longer live) on any server on April 22nd, aka 90 days ago. 3x the 30 day expiry we have.

I think it's clear that we need to abandon the practice of branching & changing URL prefixes each week.

One naive solution would be to replace old branches with symlinks to a current branch. This would mostly solve the problem of missing resources but could result in brokenness when images are resized or removed.

Raising priority and adding to Performance-Team workboard. The fact that our front ends seem to be serving expired and stale cache is concerning. The cause of this must be determined (cache white washing, or malfunctioning expiry functions, or something else) and resolved.

Aside from occasional HTTP 404 errors for static assets, this problem keeps coming back because it violates the fundamental assumption we keep echoing around that "stuff" is not cached for more that 30 days. That it's fine to:

  • Remove back-compat CSS for cached HTML 30 days after deployment because the skin html will have turned over by then.
  • Deploy changes to the ResourceLoader startup module after warming up OutputPage HTML with forward-compat syntax (e.g the RLQ feature recently) because the front-end caches will have rolled over by then.
  • Remove old wmf branches after 30 days of switching off the branch in PHP because text caches will have rolled over by then and no longer have references to those files in cached html.

Specifically the second point is the reason for this priory raise because it means we can't trust our data and are knowingly breaking pages for a small set of users in unpredictable ways.

Krinkle triaged this task as High priority.Aug 5 2015, 7:29 PM
Krinkle edited projects, added Performance-Team; removed Analytics-Backlog.

Change 229714 had a related patch set uploaded (by BBlack):
Cap all cache objects to max 30 days

Change 229714 merged by BBlack:
Cap all cache objects to max 30 days

Can we confirm that this is now working as expected?

I don't appear to have access to cp1065, otherwise I would try to reproduce @ori's results myself.

I'm pretty sure the 30 day cap works, but I'm also not really confident that was the problem to begin with. It only would've solved the case where MW itself was setting cacheability headers that were significantly greater than 30 days. The default case, and all known cases of MW headers I've seen before, were all explicitly 30 days or less anyways.

Other possible scenarios:

  1. These aren't necessarily legit requests based on our own pages - should correlate Referer if possible (e.g. 3rd-party links to our resources)
  2. External caches (although if we're talking about links straight out article content, I'm pretty sure we set CC headers back to zero for consumers outside WMF, for purging purposes)
  3. Perhaps some cache headers are allowing browser-internal caching significantly past 30 days and/or until browser restart, and people tend to leave desktop browsers open for a very long time
  4. Of course, varnish could have some horrible bug deep in object TTL handling

Reproduction based on @ori's earlier commands on cp1065 again, sorted in version order:

    2 1.25wmf1
    1 1.25wmf4
    1 1.25wmf6
    4 1.25wmf10
    2 1.25wmf12
    1 1.25wmf13
    3 1.25wmf14
    4 1.25wmf15
    3 1.25wmf16
    2 1.25wmf17
    5 1.25wmf18
    1 1.25wmf19
    2 1.25wmf20
    5 1.25wmf21
    5 1.25wmf22
    5 1.25wmf23
    2 1.25wmf24
    5 1.26wmf4
    2 1.26wmf5
    3 1.26wmf6
    5 1.26wmf7
    2 1.26wmf8
    3 1.26wmf9
    1 1.26wmf10
   69 1.26wmf16
  338 1.26wmf17
21732 1.26wmf18
    6 1.26wmf19

I even get 1.25 results, not just old 1.26 results. It could be that I'm logging at a very different time of day than the original results. Will take a look at referer next and see if that sorts any of this out...

So, our legit requests for recent versions that we expect all have an obvious Referer header of ours, as in:

146 RxURL        c /static/1.26wmf18/resources/assets/poweredby_mediawiki_88x31.png
146 RxHeader     c Referer:

I took another 10-minute sample, looking only at 1.26wmfN where N is single-digit, and capturing the other relevant headers (Host, Referer, User-Agent). None of them had Referer set, and they all had a UA of facebookexternalhit:

# timeout --foreground 600 varnishlog -nfrontend '-mRxURL:1\.26wmf\d/resources/assets/poweredby_mediawiki'     | egrep --line-buffered -i 'RxURL|Referer:|Host:|User-Agent:'
   64 RxURL        c /static/1.26wmf7/resources/assets/poweredby_mediawiki_88x31.png
   64 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   64 RxHeader     c Host:
   13 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
   13 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
   13 RxHeader     c Host:
   39 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
   39 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
   39 RxHeader     c Host:
  232 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
  232 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
  232 RxHeader     c Host:
  184 RxURL        c /static/1.26wmf7/resources/assets/poweredby_mediawiki_88x31.png
  184 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
  184 RxHeader     c Host:
   23 RxURL        c /static/1.26wmf7/resources/assets/poweredby_mediawiki_88x31.png
   23 RxHeader     c Host:
   23 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
  141 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
  141 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
  141 RxHeader     c Host:
   35 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
   35 RxHeader     c Host:
   35 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
   74 RxURL        c /static/1.26wmf3/resources/assets/poweredby_mediawiki_88x31.png
   74 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   74 RxHeader     c Host:
  124 RxURL        c /static/1.26wmf9/resources/assets/poweredby_mediawiki_88x31.png
  124 RxHeader     c Host:
  124 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
    3 RxURL        c /static/1.26wmf7/resources/assets/poweredby_mediawiki_88x31.png
    3 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
    3 RxHeader     c Host:
  156 RxURL        c /static/1.26wmf5/resources/assets/poweredby_mediawiki_88x31.png
  156 RxHeader     c Host:
  156 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
  112 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
  112 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
  112 RxHeader     c Host:
   17 RxURL        c /static/1.26wmf9/resources/assets/poweredby_mediawiki_88x31.png
   17 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   17 RxHeader     c Host:
   59 RxURL        c /static/1.26wmf9/resources/assets/poweredby_mediawiki_88x31.png
   59 RxHeader     c Host:
   59 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   77 RxURL        c /static/1.26wmf6/resources/assets/poweredby_mediawiki_88x31.png
   77 RxHeader     c Host:
   77 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   32 RxURL        c /static/1.26wmf9/resources/assets/poweredby_mediawiki_88x31.png
   32 RxHeader     c Host:
   32 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
   35 RxURL        c /static/1.26wmf5/resources/assets/poweredby_mediawiki_88x31.png
   35 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   35 RxHeader     c Host:
   15 RxURL        c /static/1.26wmf9/resources/assets/poweredby_mediawiki_88x31.png
   15 RxHeader     c Host:
   15 RxHeader     c User-Agent: facebookexternalhit/1.1 (+
   30 RxURL        c /static/1.26wmf8/resources/assets/poweredby_mediawiki_88x31.png
   30 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   30 RxHeader     c Host:
   32 RxURL        c /static/1.26wmf8/resources/assets/poweredby_mediawiki_88x31.png
   32 RxHeader     c Host:
   32 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   13 RxURL        c /static/1.26wmf8/resources/assets/poweredby_mediawiki_88x31.png
   13 RxHeader     c Host:
   13 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   63 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
   63 RxHeader     c User-Agent: facebookexternalhit/1.0 (+
   63 RxHeader     c Host:
   15 RxURL        c /static/1.26wmf4/resources/assets/poweredby_mediawiki_88x31.png
   15 RxHeader     c Host:
   15 RxHeader     c User-Agent: facebookexternalhit/1.0 (+

I took another 10-minute sample, looking only at 1.26wmfN where N is single-digit, and capturing the other relevant headers (Host, Referer, User-Agent). None of them had Referer set, and they all had a UA of facebookexternalhit:

So... these are all what? People clinking on old links in FB where FB helpfully cached the page content for them past our expiry settings?

Honestly, I have no idea, but probably something like that...

Actually, it can't be as simple as FB caching and serving outdated text to a browser, which then fetches the outdated image link, or the UA would still look like a browser UA. It's more likely that FB is rendering the page on their end (as in proxying the images and everything), and also caching objects we've said not to cache.

My thoughts here at this point are:

  1. We should delete the branches automatically and regularly, with some buffer-time past our max cache expiry for reasonable corner cases. A good starting point might be to delete them 40 days after the last wiki moves off of them (if we can calculate that easily). I think we're better off making any resulting problems more-obvious and then fixing those problems, rather than trying to hide the problem by keeping them around (or by doing some convoluted hack to rewrite/redirect them to more-current versions of the same files).
  1. I don't think we should care about things like facebookexternalhit in all of this. They're supposedly checking for 206, and they've hopefully worked out that they need to re-fetch the original page if they start getting 404s. If not, it's their problem not ours: they're the ones not obeying the fact that we asked them not to cache this stuff...

Brandon's conclusions seem right to me. What is left to do here? ("Delete the branches automatically and regularly" is still outstanding AFAIK, but that's another task.)

If the only source of traffic to old wmf-branches' w/static directories is from facebook and similar third parties then this task can be closed as far as I'm concerned. However if our own proxy caches and/or our (direct) user's browser's caches maintain references in html to old branches, then we still have to figure out why that is and fix it.

mmodell claimed this task.