Create an easy to deploy kill switch for every self-contained mediawiki functionality
Open, Needs TriagePublic

Description

While trying to workaround T160914, it seemed it was not direct and 100% obvious how to disable a certain Special page from the code. Having a "kill-switch" would allow people non familiar with the code "first incident responders" to quickly deploy a configuration option to disable specific functionality that can be done without huge code refactorings. I am specially thinking of 2: special pages- that can contain complex functionality, and api calls.

Those would only be enabled in an emergency for the reliability of the site.

As the scope of this ticket is very large, it is reduced in the following way:

  • Agree, if that is true, to document that requirement on mediawiki.org guidelines as enforced for new code functionality "any new functionality has to have an easy kill switch"
  • Set up a similar way to do if for every functionality (similar option name, same http return code, helper functions to do that, etc.)
  • If it exists already, publicize it in on operations documentation so ops are aware of those switches without looking at the entire code
  • See if it is feasible to implement for special pages and api calls, and delineate a plan- without necessarily executing it
jcrespo created this task.Mar 21 2017, 10:22 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2017, 10:22 AM
revi added a subscriber: revi.Mar 21 2017, 10:52 AM
Anomie added a subscriber: Anomie.Mar 21 2017, 1:44 PM

That would be a lot of switches.

If an API module needs to be disabled, replacing it with ApiDisabled or ApiQueryDisabled as appropriate via the appropriate globals ($wgAPIModules for action=foo, $wgAPIPropModules for action=query&prop=foo, $wgAPIListModules for action=query&list=foo, $wgAPIMetaModules for action=query&meta=foo) or hooks ('ApiMain::moduleManager' for action=foo, 'ApiQuery::moduleManager' for any query submodule) would probably be the most straightforward method.

I also note that taking out some API modules or special pages could extremely disruptive to the wikis. Special:AllPages wasn't too bad to take out (I think), but API list=allpages is also used as the equivalent of Special:PrefixIndex and Special:ProtectedPages and is currently used around 1.2 million times per day. Just under half of those list=allpages hits involve the apfilterredir parameter, but still weren't problematic (probably because they didn't also use descending order; I haven't run numbers to check that).

Legoktm added a subscriber: Legoktm.

Maybe we should have a SpecialDisabledPage class similar to the ApiDisabled module? so to disable a special page something like: $wgSpecialPages['AllPages'] = 'SpecialDisabledPage'; would be used.

So I am more interested on you to decide -if the takes makes sense- a design what requires the least amount of changes. My only request is that is should be idiot-proof. Stupid ops like me see a problem that is killing the wikis, there is no one around, and we would use the kill switch only as a last resort (when it is either that or all wikis go down). Also, of course I understand that this will be a huge scope- so I would be happy as I said on the description- with documentation that requires it only for new code only, so that new functionality has such a safeguard. Most extensions will have that -if only because they require schema changes- but sometimes it is not 100% clear to me. In an emergency, there is not much time to start reading mediawiki documentation :-)

For context: https://blog.toggl.com/2016/12/developers-explained-with-lightbulbs/ :-)

Tgr added a subscriber: Tgr.Mar 21 2017, 5:58 PM

The effects of disabling an API module can range from barely noticable to catastrophic (e.g. if it's something the mobile REST services were relying on). It's not something that should be done by someone completely unfamiliar with the code, IMO. (Special pages are less problematic.)

Usually these problems are generated by a single user doing something weird (but not an intentional DoS attack), so the obvious solution would be to block that user. IMO that's a more promising direction to build capacity for.

@Tgr That would be T160920 expanded to API calls. If you belive that is easier, I am onboard!

Tgr added a comment.Mar 21 2017, 8:12 PM

@Tgr That would be T160920 expanded to API calls. If you belive that is easier, I am onboard!

It is easier, but it's not what I had in mind. When we have single users taking up magnitudes more resources than everyone else (and basically doing an unintentional DoS attack), the response should be limited to those users. Basically, if it is not easy right now to identify IP addresses responsible for high DB load and throttling or blocking those addresses, we need tools for that. If it is already easy, we just need to document how it's done.