Page MenuHomePhabricator

During deployment old servers may populate new cache URIs
Closed, ResolvedPublic

Description

@Spage wrote:

E3 deployed several extensions with changed modules today. GettingStarted had a new RL module 'ext.gettingstarted.openTask' for a new JS file resources/ext.gettingstarted.openTask.js. scap finished 4:19PDT [00:19:44]
50 minutes after scap finished I visited a page that should have had the new code, but didn't.

My hypothesis: during scap, some user requested a page that ran the new PHP code that adds the new module, so her browser made the request to bits.wikimedia.org for modules including the new module. At that time the bits server didn't yet have the JS file, so the server response reports "missing". But then RL happily caches that response for X minutes, where X is a long time, well over 50 minutes.

This might be a dupe of T39812 , "Module cache should be invalidated (newer timestamp) when lesser old items are removed or added (scripts, style, messages)"

It turned out the original report above was indeed T39812. However there is a larger issue here.

During deployment, any change that introduces use of a new URL that is deterministically versioned and has large max-age, has a risk of indefinitely polluting the cache.

The time line is as follows:

  • Start deployment (scap, sync-file, sync-dir).
  • Server A receives new code deployment.
  • Client 1 request a page.
  • Server A responds to client 1 with the page, build with the new code.
  • Client 1 receives page from server A and requests any secondary resources (e.g. referenced script, style or image). – This resource has URIs like /static/file-name?{hash} or /load.php?module=name&{hash}.
  • Server B responds to client 1 and provides the requested resource based on its name.
  • Server B receives new code deployment.

Given we have Varnish in front of web servers, this means the outdated resource from server B is now cached at the url containing the new hash. As such, even once the deployment is over, it will continue to persist in cache indefinitely.

When adding new static files, server B would respond with 404 and it will repair itself eventually when the file exists. However for an update to static files or a change anywhere in JS/CSs modules, this can cause the Vanish cache to effectively renew old content under a new version URIs.

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160707-Echo

Details

Reference
bz45877

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:29 AM
bzimport set Reference to bz45877.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

Is the workaround to avoid these damaging situations to touch every JS and
CSS
file that one deploys before running scap?

Isn't touching startup.js enough IIRC?

Yep, sounds like a dupe. Thank you for tracking it down.

  • This bug has been marked as a duplicate of bug 37812 ***

(In reply to comment #1)

(In reply to comment #0)

Is the workaround to avoid these damaging situations to touch every JS and
CSS
file that one deploys before running scap?

Isn't touching startup.js enough IIRC?

Yeah, because that will (as a side-effect) invalidate the startup module request entirely (not just the manifest) and when it re-calculates the manifest it will see that the module has in fact changed and the next user getting the startup module will get a higher timestamp and thus a different bits url and thus the working code.

However, this may not be working because of what you think. Let me clarify just in case. The timestamp the module gets after touching startup.js is not related to the time you touch startup.js. That just forces resourceloader to re-check the timestamps of the files in the module (which you didn't touch by touching startup.js). The timestamp will still be higher because, as Matthew correctly described, the previous time resourceloader was asked to check the timestamp (due to that unfortunate visitor during scap) some of the files weren't there yet.

(In reply to comment #2)

Yep, sounds like a dupe. Thank you for tracking it down.

  • This bug has been marked as a duplicate of bug 37812 ***

No, this is not a dupe of that bug.

bug 37812 is about the case where you properly deploy a change but the timestamp (no matter how many times to invalidate the startup manifest) does not change because none of the components resourceloader checks for have changed. bug 37812 happens when the change you make only involves removing older files from the array, or adding new files that were modified longer ago, causing the max() to still be the same.

This bug (bug 45877) is about the race condition where one server is already embedding requests urls in the page (or mw.loader.load calls) while the module in question is not yet available on the apache server that the request will be made to.

e.g.
en.wikipedia.org --> srv123 @r12 --> outputs html mw.loader.load('foo')
-> bits.wikimedia.org --> srv214 @r11 --> mw.loader.state('foo', 'missing');

Where that exact url (with that timestamp) will get cached for 30 days.

Not it wont be broken for 30 days because the startup manifest en.wikipedia.org requests context work with (from bits) will be rebuilt every 5 minutes.

I'm not sure what the proper solution is for this problem. Perhaps syncing to bits firts, though that might bring the opposite problem, which is likely less visible, but might be problematic also.

S Page did the digging and filed this bug report.

I misinterpreted bug 37812 as about adding a file in general (including adding the actual file when it's already in the manifest). Upon reading more closely, I see it's only about the registry.

(In reply to Krinkle from comment #3)

This bug (bug 45877) is about the race condition where one server is already
embedding requests urls in the page (or mw.loader.load calls) while the
module in question is not yet available on the apache server that the
request will be made to.

e.g.
en.wikipedia.org --> srv123 @r12 --> outputs html mw.loader.load('foo')
-> bits.wikimedia.org --> srv214 @r11 --> mw.loader.state('foo', 'missing');

Where that exact url (with that timestamp) will get cached for 30 days.

Now it won't be broken for 30 days because the startup manifest
en.wikipedia.org requests context work with (from bits) will be rebuilt
every 5 minutes.

I'm not sure what the proper solution is for this problem. Perhaps syncing
to bits firts, though that might bring the opposite problem, which is likely
less visible, but might be equally problematic.

Rephrasing bug summary to more accurately reflect this.

The different relevant scenarios:

  1. index.php server first, load.php server second

So whenever a change of any kind is deployed (new module, or a change to an existing module), it is possible that the code might arrive on one server first (e.g. the server serving the html, with the module load queue), and then the client makes a subsequent request to load.php for those modules (handled by a server that doesn't yet have the code).

In that case, the response will be module.state=missing; Which is fine and degrades gracefully. This is not a broken state, simply the old state prior to this particular deployment. Once the deployment is finished, the next time a client requests the startup module from load.php, the module will be in there and with a higher timestamp, so the module=missing response won't be served again.

  1. index.php server and load.php/startup first, load.php/module second

First html server gets it, ensuring the module is in the load queue (if not already) and sometime before or after this, the server handling the startup module got it too. Client will make a request for the module with the newer version number in the url, gets handled by a server that we haven't synced to yet. Responds with an old version of the module (or module=missing if its a new module).

This situation doesn't resolve itself within 5 minutes because the bad response is stuck in frontend varnish cache at the correct url.

Touching startup.js won't help either.

This can only be resolved by changing the module again (touch any of the relevant module's included files), syncing that and hoping you don't get the same race condition (it's rare but I'd say it happens 1/20 times).

(In reply to Krinkle from comment #5)

The different relevant scenarios:

  1. index.php server first, load.php server second

(..)

  1. index.php server and load.php/startup first, load.php/module second

(..)

  1. server is in frankenstein state

This affects both ResourceLoader (load.php) and the main MediaWiki request handling in general. Due to how we use rsync I believe it is indeed possible for any moment during the sync action for the mediawiki install to contain files it shouldn't, miss files it needs, and has mixed newer and older versions of some files.

The window for this is relatively small as it relates to a single server as opposed to the entire cluster.

This can lead to random PHP errors (which are not as visible since those aren't cached, but they do get served to the public, mostly logged-in users only I think, unless a page gets purged of course).

This can lead to responses from load.php that will be cached and can be making all kinds of weird concatenations between incompatible files. This might resolve itself if the older file was synced first (next request the sync is finished, there'll be a newer file). But if the newer file was synced first, it won't fix itself (same as #2).

CC-ing Tim Starling. I discussed this with him a few months back and he had a few thoughts on optimising scap that would shorten the window during which this bug can occur.

Using a more atomic scap, we can reduce the window during which #3 can happen. #2 can still happen though because atomic scaps only helps a single server maintain its state, between servers there will still be many seconds 1-2 minutes delay depending on how long the sync action takes across the cluster.

With something like git-deploy will mitigate both.

For #3:Near impossible because it works with working copies (fetch internally first, then check out)

For #2: It first instructs all servers to fetch, and when finished all of them to checkout. So there's much shorter window during which difference servers have different active directories because copy time is taken out of the equation

Is this really a problem with the atomicity of the scap process in general?

Making the MW deploy process more atomic is a goal for the "next generation scap" project. It would be good to have some use cases such as this one documented in the talk page for the deploy system requirements [0] or somewhere near by so we can try and squash ugly edge case problems such as this one related to the front end caching architecture.

One random idea that just popped into my head would be to create some system for telling the front end varnishes that things will be changing soon and that they should not cache any new responses until given the all clear sign.

[0]: https://www.mediawiki.org/wiki/Talk:Deployment_tooling/Notes/Deployment_system_requirements

(In reply to Krinkle from comment #6)

This affects both ResourceLoader (load.php) and the main MediaWiki request
handling in general. Due to how we use rsync I believe it is indeed possible
for any moment during the sync action for the mediawiki install to contain
files it shouldn't, miss files it needs, and has mixed newer and older
versions of some files.

The window for this state on a given host should be very short. We give rsync the --delete-delay and --delay-updates flags which defer removing or changing files until all changes are present on the local host.

Krinkle renamed this task from ResourceLoader: Bad cache stuck due to race condition with scap between load.php and index.php server to Bad cache stuck due to race condition with scap between different web servers.Nov 25 2014, 8:54 PM
Krinkle updated the task description. (Show Details)
Krinkle removed a project: Future-Release.
Krinkle set Security to None.
Krinkle removed a subscriber: Unknown Object (MLST).

It may be short but it just broke nearly all javascript again on 1.25wmf11.

During the deployment, the server responding to bits.wikimedia.org/www.mediawiki.org/load.php?modules=startup started responding with the new 1.25wmf11 code and making the client request modules=user.defaults&version=20141208T185727Z (new in 1.25wmf11).

That request got handled by a web server still on 1.25wmf10. Its response is cached by Varnish with a nice long-expiry of 30 days (as instructed by ResourceLoader).

Thus, long after the deployment was done, all users were getting an the broken user.defaults response. This resulted in cascading failures since a lot of modules depend on it.


As long as syncing takes longer than it takes for one user to make two web requests (one based on the other), this will eventually break things again. Whenever we sync-file, Whenever we create a new wmf-branch, and whenever deploy a wmf-branch to a next set of wikis; each time we roll the dice.

This seems important but has no assignee, can someone take this? Otherwise I will probably set to normal priority has it has been sitting.

@greg since this involves scap is this a release engineering thing? Would you be willing to kind of shepherd this to that effect?

I think this is really a bug about the fact that we don't depool/repool servers during a the rolling update process. Fixing it would be awesome but is non-trivial as we would really need to make pybal aware of the deployment process.

So we think this is inherent to the design of our deploy mechanisms?

@Joe, is there a ticket somwhere for revamping deploy that this would be appropriately rolled into?

chasemp lowered the priority of this task from High to Medium.Jan 8 2015, 3:19 PM

Seems like no meaningful update in a month, no assignee, and no direct actionable that I can see. I am dropping priority here. If someone takes this, or if it's more important please intervene and get the ball rolling.

So we think this is inherent to the design of our deploy mechanisms?

If so, this report should be moved from MediaWiki-ResourceLoader to Wikimedia-Deployment-systems.

It's an over-optimisation in ResourceLoader which creates a bug exposed by the way we deploy.

Krinkle renamed this task from Bad cache stuck due to race condition with scap between different web servers to During deployment old servers may populate new cache URIs.EditedJun 25 2015, 7:01 PM
Krinkle updated the task description. (Show Details)
Krinkle edited projects, added Performance-Team; removed MediaWiki-ResourceLoader.

It's not an over-optimisation in RL. I've rephrased the summary to more clearly define the problem.

Krinkle renamed this task from During deployment old servers may populate new cache URIs to During deployment old servers may populate new cache URIs (tracking).Nov 3 2015, 7:36 PM
Danny_B renamed this task from During deployment old servers may populate new cache URIs (tracking) to During deployment old servers may populate new cache URIs.Jul 16 2016, 3:10 AM
Danny_B removed a project: Tracking-Neverending.
Krinkle claimed this task.

Change 514193 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[operations/mediawiki-config@master] Remove unused preference 'T47877-buster'

https://gerrit.wikimedia.org/r/514193

Change 514193 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove unused preference 'T47877-buster'

https://gerrit.wikimedia.org/r/514193

Mentioned in SAL (#wikimedia-operations) [2019-06-07T00:00:49Z] <jforrester@deploy1001> Synchronized wmf-config/CommonSettings.php: Remove unused preference T47877-buster (duration: 00m 47s)