Page MenuHomePhabricator

Create an easy to deploy kill switch for every self-contained mediawiki functionality
Closed, ResolvedPublic

Description

While trying to workaround T160914, it seemed it was not direct and 100% obvious how to disable a certain Special page from the code. Having a "kill-switch" would allow people non familiar with the code "first incident responders" to quickly deploy a configuration option to disable specific functionality that can be done without huge code refactorings. I am specially thinking of 2: special pages- that can contain complex functionality, and api calls.

Those would only be enabled in an emergency for the reliability of the site.

As the scope of this ticket is very large, it is reduced in the following way:

  • Agree, if that is true, to document that requirement on mediawiki.org guidelines as enforced for new code functionality "any new functionality has to have an easy kill switch"
  • Set up a similar way to do if for every functionality (similar option name, same http return code, helper functions to do that, etc.)
  • If it exists already, publicize it in on operations documentation so ops are aware of those switches without looking at the entire code
  • See if it is feasible to implement for special pages and api calls, and delineate a plan- without necessarily executing it

Event Timeline

That would be a lot of switches.

If an API module needs to be disabled, replacing it with ApiDisabled or ApiQueryDisabled as appropriate via the appropriate globals ($wgAPIModules for action=foo, $wgAPIPropModules for action=query&prop=foo, $wgAPIListModules for action=query&list=foo, $wgAPIMetaModules for action=query&meta=foo) or hooks ('ApiMain::moduleManager' for action=foo, 'ApiQuery::moduleManager' for any query submodule) would probably be the most straightforward method.

I also note that taking out some API modules or special pages could extremely disruptive to the wikis. Special:AllPages wasn't too bad to take out (I think), but API list=allpages is also used as the equivalent of Special:PrefixIndex and Special:ProtectedPages and is currently used around 1.2 million times per day. Just under half of those list=allpages hits involve the apfilterredir parameter, but still weren't problematic (probably because they didn't also use descending order; I haven't run numbers to check that).

Legoktm subscribed.

Maybe we should have a SpecialDisabledPage class similar to the ApiDisabled module? so to disable a special page something like: $wgSpecialPages['AllPages'] = 'SpecialDisabledPage'; would be used.

So I am more interested on you to decide -if the takes makes sense- a design what requires the least amount of changes. My only request is that is should be idiot-proof. Stupid ops like me see a problem that is killing the wikis, there is no one around, and we would use the kill switch only as a last resort (when it is either that or all wikis go down). Also, of course I understand that this will be a huge scope- so I would be happy as I said on the description- with documentation that requires it only for new code only, so that new functionality has such a safeguard. Most extensions will have that -if only because they require schema changes- but sometimes it is not 100% clear to me. In an emergency, there is not much time to start reading mediawiki documentation :-)

For context: https://blog.toggl.com/2016/12/developers-explained-with-lightbulbs/ :-)

The effects of disabling an API module can range from barely noticable to catastrophic (e.g. if it's something the mobile REST services were relying on). It's not something that should be done by someone completely unfamiliar with the code, IMO. (Special pages are less problematic.)

Usually these problems are generated by a single user doing something weird (but not an intentional DoS attack), so the obvious solution would be to block that user. IMO that's a more promising direction to build capacity for.

@Tgr That would be T160920 expanded to API calls. If you belive that is easier, I am onboard!

@Tgr That would be T160920 expanded to API calls. If you belive that is easier, I am onboard!

It is easier, but it's not what I had in mind. When we have single users taking up magnitudes more resources than everyone else (and basically doing an unintentional DoS attack), the response should be limited to those users. Basically, if it is not easy right now to identify IP addresses responsible for high DB load and throttling or blocking those addresses, we need tools for that. If it is already easy, we just need to document how it's done.

I believe this should be renamed to "Approve a policy to not allow new features being deployed to production that don't have a kill switch".

Maybe we should have a SpecialDisabledPage class similar to the ApiDisabled module? so to disable a special page something like: $wgSpecialPages['AllPages'] = 'SpecialDisabledPage'; would be used.

DisabledSpecialPage exists now, FWIW. The syntax is

$wgSpecialPages[<special page name>] = DisabledSpecialPage::getCallback( <special page name>, <optional message key or object> );

Not sure what would be a good place for documenting (I'll add it to the developer manual, but that's probably not where someone would look in an emergency).

I still think what's proposed here is fundamentally infeasible though. People should not be in the business of disabling things without having any understanding of what they do; that can easily turn a partial outage into a larger one.

Krinkle claimed this task.
Krinkle subscribed.
  • API modules have a feature flag, always have, non-optional.
  • Every extension has a feature flag, always have (apart from a handful of exceptions for very old and stable extensions that we enable globally). This is generally non-optional and included for new extensions by default since at least 2012.
  • Some expensive queried are throttled by PoolCounter. If there are other queries we know to be expensive, I suggest specific tasks are filed for those. For new functionality, this would be raised during performance review.

I've documented the last point at https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Backend_performance#Rate_limiting

  • API modules have a feature flag, always have, non-optional.
  • Every extension has a feature flag, always have (apart from a handful of exceptions for very old and stable extensions that we enable globally). This is generally non-optional and included for new extensions by default since at least 2012.
  • Some expensive queried are throttled by PoolCounter. If there are other queries we know to be expensive, I suggest specific tasks are filed for those. For new functionality, this would be raised during performance review.

I've documented the last point at https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Backend_performance#Rate_limiting

Where is the feature flag for api modules?

  • API modules have a feature flag, always have, non-optional. […]

Where is the feature flag for api modules?

TIf an API module needs to be disabled, replacing it with ApiDisabled or ApiQueryDisabled as appropriate via the appropriate globals ($wgAPIModules, $wgAPIPropModules, $wgAPIListModules, $wgAPIMetaModules) or hooks […]

https://www.mediawiki.org/wiki/Manual:$wgAPIModules (Doxygen doc, Examples in Codesearch).

  • API modules have a feature flag, always have, non-optional. […]

Where is the feature flag for api modules?

TIf an API module needs to be disabled, replacing it with ApiDisabled or ApiQueryDisabled as appropriate via the appropriate globals ($wgAPIModules, $wgAPIPropModules, $wgAPIListModules, $wgAPIMetaModules) or hooks […]

https://www.mediawiki.org/wiki/Manual:$wgAPIModules (Doxygen doc, Examples in Codesearch).

Thanks, sorry I didn't see that earlier