Page MenuHomePhabricator

InstantCommons can render a wiki completely unavailable during an outage.
Closed, InvalidPublic

Description

During both the DDoS and the parent task, we noticed that third party wikis can be taken down if InstantCommons is unavailable.

By taken down, I mean a response will time out completely not just images are missing.

For me, it also affected non-InstantCommons wikis because Varnish depooled stuff.

This should fail gracefully. I originally thought this was fixed because during T272215: High latency on appservers images just didn't appear for us.

Event Timeline

For the note, I know parsercache and squid proxies could both mitigate this but this focuses on assuming there is a standard install of InstantCommons.

As far as I can tell, https://gerrit.wikimedia.org/g/mediawiki/core/+/fe289e90e1a2040a7caa8146c2c09562ccfedcc5/includes/filerepo/ForeignAPIRepo.php#506 defaults to $wgHTTPTimeout, which is 25 (seconds I assume). That seems a bit high for making API queries IMO, but reasonable for downloading images. Maybe a lower timeout should be set for those queries, which I assume is what hung your wikis?

For me, it also affected non-InstantCommons wikis because Varnish depooled stuff.

This seems like a configuration issue on your side, whatever health check you're using for Varnish shouldn't depend upon InstantCommons.

For me, it also affected non-InstantCommons wikis because Varnish depooled stuff.

This seems like a configuration issue on your side, whatever health check you're using for Varnish shouldn't depend upon InstantCommons.

I will have a look at what wiki varnish checks but I think most of our wikis probably have it on. I agree though and it is on my list of follow ups from our incident to look at Health checks using a more minimalist wiki.

$wgHTTPConnectTimeout is 5 sec if left at default value, and that's what should matter here (unless you don't have curl installed and are falling back to the pure PHP implementation, which is a bad idea in production).

The total timeout from MediaWiki/curl is 30 seconds which should stop this but then I smartly remembered we have Varnish which also has a 30 second mw* -> cp* akin to the WMF's setup.

I'd guess that Varnish needs a lower timeout than MediaWiki so despite me asking our team this 18 months ago and being told it didn't exist. I guess it does and it's our config.