Page MenuHomePhabricator

Last-Modified header makes pages uncacheable in frontend caches when it was modified after $wgSquidMaxage
Closed, DeclinedPublic

Description

The Last-Modified HTTP header, used to check if a cached page has been modified since it was cached, is wrongly set under certain circumstances:

Conditions:

In OutputPage::checkLastModified due to T46570 it was added:

if ( $config->get( 'UseSquid' ) ) {
	// T46570: the core page itself may not change, but resources might
	$modifiedTimes['sepoch'] = wfTimestamp( TS_MW, time() - $config->get( 'SquidMaxage' ) );
}

This adds another time to later use the most recent time for the Last-Modified. The problem is: it calculates a date that moves every second. The date is always $wgSquidMaxage seconds before "now".

The default value of $wgSquidMaxage is 18000 seconds (5 hours). If a page is last modified more than 5 hours ago, the $modifiedTimes['sepoch'] timestamp will likely be the most recent one, and the Last-Modified: header will be constantly moving every second, making each request with an If-Last-Modified header return the full page again instead of the cached one.

How should it work:. To prevent a cache stampede, it's reasonable to distribute page expirations across a range of time instead of use a fixed time for each page. But the calculation shouldn't be a function that returns a time that moves each second. Ideally it should move every $wgSquidMaxage seconds

Possible solution:

Calculate how many "ages" have passed since last page modification timestamp, and make this date increment every "age":

$timeLastMod = wfTimestamp( TS_UNIX, $modifiedTimes['page'] );
$timeSinceLastMod = time() - $timeLastMod;
$ages = floor( $timeSinceLastMod / $wgSquidMaxage );
$modifiedTimes['sepoch'] = wfTimestamp( TS_MW, $timeLastMod + $wgSquidMaxage * $ages );

Assuming $wgSquidMaxage is 30 days, if the page was last modified on August 1th, this timestamp will be incremented to September 1th, October 1th... the most recent date before today.

Event Timeline

Change 385418 had a related patch set uploaded (by Martineznovo; owner: Martineznovo):
[mediawiki/core@master] Change "sepoch" Last-Modified calculations constant in periods of time

https://gerrit.wikimedia.org/r/385418

I might not be understanding the problem properly, but isn't the correct fix to the original problem just to adjust the cache expiry time returned to Squid to 30 days after the If-Modified-Since date, and not tamper with the Last-Modified header? So if the article was last modified on March 12, and on March 22 Squid sends an If-Modified-Since: March 12, reply with a 304 that has a cache expiry time of 20 days instead of 30. Then Last-Modified can be kept correct.

So if the article was last modified on March 12, and on March 22 Squid sends an If-Modified-Since: March 12, reply with a 304 that has a cache expiry time of 20 days instead of 30. Then Last-Modified can be kept correct.

This is my proposed solution, Of course, on April 22 the cache expiry should be also 20.

I don't think it matches the proposed solution in the first post. I meant to say that if Squid obtained the page at midnight March 1, with a 30-day expiry, and then requests it again at midnight March 2, it should receive a 29-day expiry. Then if it requests it again at midnight March 3, it should get a 28-day expiry. And so on, until if it requests it at midnight March 30, it should get a 1-day expiry, and if it requests it at 23:59:59 March 30, it should get a 1-second expiry. In other words, the original resource will never get cached for longer than the original 30 days.

I didn't formulate it correctly in my last post, though. We can't use If-Modified-Since, because that's when the page was updated, not when the proxy retrieved it. I guess to implement this we'd need to patch Varnish(/Squid). The simplest way would probably be to hack it to ignore the Expires header on a 304 and keep the original expiration date. It makes sense not to follow HTTP's semantics for Expires insofar as we're not respecting HTTP's caching semantics in general, because we're acting as though the cached resource is up-to-date when it's not really up-to-date. Note that this solution completely avoids cache stampedes.

I feel like I'm missing something, though. Does this change actually disable reverse proxy caching entirely for pages on the Wikimedia cluster that weren't edited in the last 30 days? That should have caused a huge CPU spike. It seems unlikely that this went unnoticed for five years.

Varnish or Squid won't "revalidate" the contents of the cache, sending requests with If-Modified-Since while the s-maxage is still current, hence this doesn't disable caching at all. This is not a problem for WMF. The problem is when one uses a reverse-proxy that doesn't support PURGE requests, and you configure it to "revalidate" on each request, sending If-Modified-Since headers from the cached content every time.

I'm now using Varnish, so I don't really care for this anymore. Also, I've overridden this logic myself implementing the OutputPageCheckLastModified hook.

The idea of returning a shorter expiry based on the age of the object in a reverse-proxy cache would depend on MediaWiki knowing the age of said cached object, which currently isn't sent as part of a normal HTTP request with If-Modified-Since headers.

When enabling UseSquid without actually using a cache proxy that adheres to s-maxage, that is bound to cause issues given it isn't supposed to revalidate during the maxage.

When not enabling that, MediaWiki does not permit public caching of the HTML given there'd be no way to purge it. However, it does allow caching in public proxies and in browsers, so long they revalidate on every request. It seems to me that that logic is working fine. No matter how recent or how long ago a page was last modified, the browser can cache it, and upon refresh with If-Modified-Since, it gets the appropiate 304 Not Modified response.

From what I can tell, the set-up only breaks when incorrectly enabling UseSquid, which adds (time-maxage) to the mix, and thus causes this moving barrier that makes it impossible to get a 304 response on pages last modified >maxage ago. But as mentioned, it should work fine when disabling UseSquid.

Change 385418 abandoned by Krinkle:
Make "sepoch" Last-Modified calculations constant in periods of time

Reason:
Closing for now given the issue is with not with a caching proxy that adheres maxage rules, but with a private cache that revalidates (like a browser). With that mode, UseSquid should be disabled, and that avoids the issue.

If the moving barrier bug also happens with UseSquid disabled, then I propose to re-open this patch or create a new one.

https://gerrit.wikimedia.org/r/385418

Krinkle triaged this task as Medium priority.Apr 20 2018, 1:35 AM
Krinkle removed a project: Patch-For-Review.
Krinkle moved this task from General to HTTP Cache on the MediaWiki-libs-BagOStuff board.

@Krinkle You abandoned the patch but didn't closed the task... why? If you're sure this isn't going anywhere, feel free to close it as declined/invalid.

As I said, I'm using the hook to calculate it as I feel it should work, even if I'm using Varnish now. Because if I restart Varnish, I don't want that a web crawler hitting gazillions pages per second for the first time to cache them and causing Varnish to expire them all at the same time every $wgSquidMaxage periods. For me, having an expiration time that moves every second is conceptually wrong.

Change 385418 restored by Krinkle:
Make "sepoch" Last-Modified calculations constant in periods of time

https://gerrit.wikimedia.org/r/385418

Change 427919 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] OutputPage: Factor out CdnCacheEpoch logic and cover with tests

https://gerrit.wikimedia.org/r/427919

I've kept this task open to follow-up and understand in what way the moving timestamp is observable in a problematic way, given that if maxage is honoured, it cannot be observed, and if maxage is not honoured (eg. always revalidate) then a client would make a request each time no matter what last-modified says (and UseSquid would be off, and thus no moving timestamp).

Imagine a scenario with a Varnish/Squid cache, and a wiki with lots of visits each day on different pages and very few edits. Please correct me if I'm wrong.

  • $wgSquidMaxage is set to 5 hours (the default), no edit has been done since (for example) 24h.
  • 01-01-2018 00:00:00h: Varnish/Squid is restarted, cache is empty.
    • Page views start to populate the cache. There are a constant flow of visits evenly distributed that cover like 100 different pages in a short period of time (5 minutes for example)
    • Each page is fetched from MediaWiki 1 time to populate the cache, and is stored with last-modified 31-12-2018 19:00:00h, and stored with s-maxage of 5 hours.
  • 01-01-2018 05:00:00h: The 100 pages of the example are expired.
    • The constant flow of requests reach those 100 pages in 5 minutes, causing 100 new requests go through MediaWiki during those 5 minutes
  • ...

Of course, this is probably the most common situation. However, on big wikis like Wikipedia this may be actually possible.

If instead of a constant-moving Last-Modified on each second, the date is moved in constant blocks of $wgSquidMaxage periods, this scenario is not impossible, but more difficult to achieve. It would require those 100 pages to be edited at nearly the same time (this could be caused by a common template edit, though).

The stampede after Varnish/Squid restart is unavoidable. However, it may be recurring at constant intervals in a scenario similar to what I've described.

Change 427919 merged by jenkins-bot:
[mediawiki/core@master] OutputPage: Factor out CdnCacheEpoch logic and cover with tests

https://gerrit.wikimedia.org/r/427919

@Ciencia_Al_Poder The idea of spreading out the expiry, based on either randomness or based on the page's last modified timestamp, makes sense. However, I don't see how changing the Last-Modified timestamp would change the fact that the 100 objects will still expire after 5 hours due to the s-maxage set, and if those 100 objects are requested again after that, they will still result in 100 requests to MW.

To achieve what you describe, it seems to me like you'd want to nudge the maxage value.

You're right about also changing maxage

This normally shouldn't be a long-term problem, because the cache should not be completely cleared or expired in normal operation. If the cache is cleared one time, then you'll have cache stampedes after the first couple of $wgSquidMaxage periods, but it should even out over time due to the randomness of request distribution.

Put it this way: if your site can't function with an empty cache, then you have to make sure your cache is never emptied even without this issue. If it can function with an empty cache, you'll survive a few cache stampedes. It's not the end of the world.

If this is actually a real-world problem for anyone, then the right fix is making the maxage random, not fiddling with Last-Modified.

As I understand it, Last-Modified constantly moving behind the scenes should not matter because there is maxage involved. A browser or cache proxy is not meant to revalidate upon every request. That would only make sense if maxage=0 was set. Given maxage is generally set to wgSquidMaxage this means the cache proxy will (and should) blindly use this for that amount of time. Then after that, it will check in again.

The reason the wgSquidMaxage factor was added to the list of timestamps is to ensure that untracked variation of page views roll over within the wgSquidMaxage time period. For example, when making changes to the skin layout, or changing site configuration variables, or adding/removing extensions; you can assume that after wgSquidMaxage has passed, all objects in your cache proxy have rollover at least once and no longer qualify for renewal without regenerating/retransferring from a backend first.

If you are finding that there are significant peaks and have ideas for avoiding those, or if you do find constant renewal within a short period for the same unchanged pages, then please do re-open or file a new task, as that is not meant to happen indeed indeed.