Page MenuHomePhabricator

New Service Request Security API Gateway
Closed, DeclinedPublic

Description

Name: Security API Gateway

Description: There is an increasing need for a centralized Wikimedia service capable of making available certain security-related APIs for various MediaWiki extensions, services, external applications and users. Due to certain sensitive elements (sensitive data, commercially-licensed data, etc.) this service would need to live within Wikimedia production, have some variety of general authn/z mechanism and be highly available. Initial API candidates would likely be feed options related to the protected task T265845 and T250227.

Timeline: Tentatively by the end of Q3 2022 (March 2022)

Point person(s): @sbassett, @Reedy, @Mstyles, @STran

Technologies: Likely service-runner and various nodejs glue code to manage ingestion/consumption and authn/z layers.

Request flow diagram: To be created, though these will likely be minimal as this would be more of a stand-alone API service with its own authn/z.

n.b. keeping Service-deployment-requests untagged for now as this effort is currently very early in the initial planning stage (more proof-of-concept, minimum-viable-product) and might require an RFC or similar technical discussion phase.

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
ResolvedMstyles
ResolvedMstyles
ResolvedNone
ResolvedSTran
ResolvedMstyles
DuplicateSTran
ResolvedMstyles
Declinedsbassett
Resolvedsbassett
DeclinedNone
OpenNone
Resolvedsbassett
ResolvedSTran
Resolvedsbassett
ResolvedMstyles
Resolved Marostegui
ResolvedMstyles
Resolvedkostajh
ResolvedAklapper
Resolvedsbassett

Event Timeline

I suppose two initial questions/decisions would be:

  1. Should we architect the (likely small but important) collection of security-related APIs/services as single, individual services or as part of a gateway, as the task suggests?
  2. Is Wikimedia production even the right place for this? I think it could be, but I don't have a great answer on that. wmcs/toolforge can't work for a few security/privacy/TOU-related reasons. And there is now precedent for certain Important™ things to live elsewhere (WME/Okapi at AWS) but that presents a different set of challenges, while removing others.

Thanks for filing this :) Sidenote: the name api-gateway is already in use (naming things is hard!!), it would be nice if this one could be slightly different.

this would be more of a stand-alone API service with its own authn/z.

I am curious what the advantages of separate authn/z are versus fronting it with a MediaWiki extension that takes care of authn/z before proxying the request to this service.

I suppose two initial questions/decisions would be:

  1. Should we architect the (likely small but important) collection of security-related APIs/services as single, individual services or as part of a gateway, as the task suggests?

I think it depends on how you expect clients to use it, will each one likely be calling just one service or will they all want multiple? Would clients benefit from the gateway aggregating multiple services in one respnse? Also to consider is if we want to change in the future, are all the clients under shared maintenance in which we can expect maintainers to quickly adapt to new things or will it require long deprecation periods?

My suggestion would be to build it as a gateway but each API/service is its own route, which is how Shellbox works. Then it's trivial to have one gateway deployment handle everything, or if we decide we want to split them into different deployments (e.g. want to dedicate more resources to one specific service), it'll pretty straightforward to spin up a new deployment in k8s using the same codebase (e.g. shellbox-constraints, shellbox-syntaxhighlight, etc.)

  1. Is Wikimedia production even the right place for this? I think it could be, but I don't have a great answer on that. wmcs/toolforge can't work for a few security/privacy/TOU-related reasons. And there is now precedent for certain Important™ things to live elsewhere (WME/Okapi at AWS) but that presents a different set of challenges, while removing others.

My understanding is that Wikimedia Enterprise is using AWS because SRE couldn't/didn't want to provide contractual level SLAs (https://meta.wikimedia.org/wiki/Wikimedia_Enterprise/FAQ#Why_are_you_using_externally-operated_cloud_infrastructure/AWS) - I don't think that's applicable here and not a precedent. Otherwise I think production k8s is a pretty good fit for this. Things you get "for free": logging to logstash, monitoring and alerting, metrics to grafana, security monitoring via debmonitor, and probably more.

@sbassett before I can give any opinion on your proposal and your questions, I'd need to better understand (maybe one practical example?) of what the request flow would be, and what is the intent for this software.

From what I read I would say it would look indeed like we could just deploy a second instance of api-gateway with a different configuration of routes. But maybe I'm missing something here.

So, can you describe one practical use-case for this in some detail? It will help me ensure I don't give you bad advice :)

Thanks for filing this :) Sidenote: the name api-gateway is already in use (naming things is hard!!), it would be nice if this one could be slightly different.

Is security-api-gateway too close? security-api? wikimedia-security-api? I guess I'd hate to get too far away from simple names that precisely describe what the thing is.

I am curious what the advantages of separate authn/z are versus fronting it with a MediaWiki extension that takes care of authn/z before proxying the request to this service.

Probably none for right now, at least for initial use-cases. Though in the future it might provide us with increased flexibility for allowing integrations with new apps, tools, external organizations, etc. I know there are also some ideas floating around in regards to decoupling auth from mediawiki and other apps/services, though that's obviously well beyond the scope of this work for now.

I think it depends on how you expect clients to use it, will each one likely be calling just one service or will they all want multiple? Would clients benefit from the gateway aggregating multiple services in one respnse? Also to consider is if we want to change in the future, are all the clients under shared maintenance in which we can expect maintainers to quickly adapt to new things or will it require long deprecation periods?

My suggestion would be to build it as a gateway but each API/service is its own route, which is how Shellbox works. Then it's trivial to have one gateway deployment handle everything, or if we decide we want to split them into different deployments (e.g. want to dedicate more resources to one specific service), it'll pretty straightforward to spin up a new deployment in k8s using the same codebase (e.g. shellbox-constraints, shellbox-syntaxhighlight, etc.)

The route(s)-per-service concept makes sense to me and would allow a lot of flexibility in spinning up or down various security-related services. So, sample routes following a pattern along the lines of: /service-name/version/endpoint1/whatever. Things like Shellbox certainly make sense as stand-alone services, but there are a handful of security-related services that could live under a single, monolithic api gateway IMO.

My understanding is that Wikimedia Enterprise is using AWS because SRE couldn't/didn't want to provide contractual level SLAs (https://meta.wikimedia.org/wiki/Wikimedia_Enterprise/FAQ#Why_are_you_using_externally-operated_cloud_infrastructure/AWS) - I don't think that's applicable here and not a precedent. Otherwise I think production k8s is a pretty good fit for this. Things you get "for free": logging to logstash, monitoring and alerting, metrics to grafana, security monitoring via debmonitor, and probably more.

Yes, a Wikimedia production service does probably make the most sense for something like this, given the dissimilar requirements for this and something like WME.

@sbassett before I can give any opinion on your proposal and your questions, I'd need to better understand (maybe one practical example?) of what the request flow would be, and what is the intent for this software.

From what I read I would say it would look indeed like we could just deploy a second instance of api-gateway with a different configuration of routes. But maybe I'm missing something here.

So, can you describe one practical use-case for this in some detail? It will help me ensure I don't give you bad advice :)

Hey @Joe -

Sure, the first use-case for the security api gateway would be a couple of simple routes for clients to access data from a commercially-licensed deny-list. Much of the background for this is discussed, at length, within T265845 (I just subbed you there). I would envision a couple of basic routes like search, which could take an IP address, CIDR block, etc. and provide results and perhaps a diff route which would provide a list of current IP addresses based upon the vendor's updated feed (daily or 5-minute intervals, depending upon the product). This data would be consumable by various Wikimedia bots, MW extensions and potentially other, related security-tooling down the road. I understand that there is a similar use-case for MaxMind data within Wikimedia production, but that approach appeared to be a bit more inflexible to me for how the community and various WMF teams might want to easily consume relevant data. If that is not the case, then that approach might obviate the need for a security api service like this, but I would want to ensure the flexibility in being able to quickly add or remove security-related services and provide convenient access to these services for various applications and tools.

sbassett triaged this task as Medium priority.Oct 18 2021, 4:17 PM
sbassett updated the task description. (Show Details)

@Reedy mentioned that we probably should create a Phab project for this work, at some point.

Sounds good. I'll decline this one in favor of that one. Not sure what sub-tasks here are still relevant, though I imagine those can be declined or added to the new task at some point.