Page MenuHomePhabricator

API keys
Open, Needs TriagePublic

Description

Require client API keys. Not strictly required for the Parsoid REST API, but it's important for later REST API work, so we may want to think about it now.

Ideally, 2-legged OAuth 2.0 bearer tokens for this particular API. Basically, a "magic cookie" provided out-of-band that identifies the client.

Event Timeline

Anomie added a subscriber: Anomie.Apr 17 2019, 1:30 AM

This task asserts that it's important, but does not state why.

Our other APIs rely on MediaWiki's standard account authentication without excess tokens and such. Considering many of our clients are open source and tend to run in untrusted environments (such as browser JS), what would this gain us?

Tgr added subscribers: bd808, Tgr.Apr 19 2019, 4:58 PM

There was a push for API keys as a prerequisite for some vague monetization ideas when the previous ED was trying to reshuffle things; that was opposed vigorously and eventually got dropped. I don't remember the details, I think @bd808 was asked to work on it since he was doing API analytics at the time, so maybe he has pointers to the old discussions.

Currently we have user agents as a very ad-hoc client identification mechanism, and AFAIK we don't even make much use of that.

bd808 added a comment.Apr 19 2019, 7:57 PM

I don't know if any of the past discussions are actually persisted in a meaningful form sadly. My recollection is that the core idea of requiring some sort of API token was not the main point of contention. The main point I was strongly against at the time was doing so solely to create a "fast lane" for registered users providing monetary compensation.

Introducing a strict requirement for authenticated access/api tokens to existing APIs would be a compatibility breaking change obviously, but I can personally see possible advantages for both API consumers and providers in having some more reliable method to authorize and monitor API usage. It is a fairly common operational support question to ask "how can I contact the person who is doing X" where 'X' is usually related to an extreme rate of requests or errors in using one of our many API endpoints. It might also help with the long stalled T102079: Metrics about the use of the Wikimedia web APIs task where one of the blockers to the "Ranking of user agents" is that User-agent information is considered PII and thus not something we can easily use in request correlation.

Introducing an authentication requirement for a new collection of API endpoints would be a less disruptive change for consumers as they could be made aware of it from day 0 rather than relying on broadcast messages and warnings in the API responses to announce a compatibility change.

Anomie added a comment.EditedApr 19 2019, 8:08 PM

The difficulty with all of that is that it assumes clients can and will keep their keys private. In practice, I think we'd likely find that malicious actors would just copy the client key out of AWB, Pywikibot, MediaWiki's client JavaScript, or the like. Or else every user of AWB, Pywikibot, and so on would have to individually register their own key, making it little different from just using the authenticated account plus the user agent to identify users.

I note that OAuth 1.0a has the same problem, BTW. While IIRC OAuth 2 just says "don't actually rely on these keys for identifying 'public' clients" without really providing any solution.

Many, but not all, of our current clients are Open Source or Web-based. We would like to have more clients, some of which will not be Open Source or Web-based.

We would like to have an idea of how much traffic is being used per client. Yes, monetizing this relationship is part of the discussion.

For me, another aspect is getting a formal agreement with syndicators of our content that requires not only license compliance but also good practices like good-faith efforts to enable editing.

I think that it's reasonable to allow some unauthenticated traffic (no identified client, no identified user). It's likely we'd have a lower rate limit per T221162 for unauthenticated requests.

oauth.com has a pretty good discussion on how to use API keys without client secrets in the context of browser-based apps, https://www.oauth.com/oauth2-servers/single-page-apps/ .

We would like to have an idea of how much traffic is being used per client. Yes, monetizing this relationship is part of the discussion.

I couldn't find much past discussion about monetization, but the thread starting at https://lists.wikimedia.org/pipermail/wikimedia-l/2016-January/081126.html did turn up.

I think that it's reasonable to allow some unauthenticated traffic (no identified client, no identified user). It's likely we'd have a lower rate limit per T221162 for unauthenticated requests.

I suspect you'll have a hard time striking a balance between continuing to allow our volunteers to create the tools that help them work and forcing these "syndicators" you're worried about to use your monetized mechanism.

I don't think the public REST API that this task is part of is a good place to try to draw that line. It seems more likely to me that you'd merely push people to not use the REST API.

oauth.com has a pretty good discussion on how to use API keys without client secrets in the context of browser-based apps, https://www.oauth.com/oauth2-servers/single-page-apps/ .

https://www.oauth.com/oauth2-servers/single-page-apps/#security-considerations is exactly what I was referring to about it not really providing a solution. Even requiring that the client use a pre-registered return_uri for some measure of security, as mentioned there, will fail if the malicious app reusing some other app's key is running a webview or the like where it's not limited by the same-origin policy.

Remember, the attack scenario here isn't to fool the end user into thinking the malicious app is actually the copied app. We can probably assume the "end user" is well aware. It's to fool MediaWiki into thinking it's the copied app so as to take advantage of the copied app's higher rate limits or to avoid being easily blocked as revoking the copied api key would also affect users of the copied app.

Tgr added a comment.Apr 24 2019, 12:50 AM

I'd be skeptical about there being much overlap between the clients which use a high-enough request ratio to cause us significant problems or incur nontrivial costs, and the clients which can be monetized. Google and co have the engineering chops to optimize retrieval rates, set up local copies of the data, etc. It's the odd community bot / university researcher / Turkey mirror that would be caught by the limits.

Also I think it's good to consider the realities of WMF engineering. Right now, creating developer accounts (which would presumably be the means to register an API key) has been disabled for several weeks due to a spam situation we are not well equipped to deal with. If such a scenario would block API access (or high-rate API access, for some value of "high rate") entirely, it would be much more disruptive (and it already is pretty disruptive). We don't have an API team, we don't have an auth team, our partnership team is not really tech-focused... API monetization might be blocked on organizational gaps as much as technical ones.