Page MenuHomePhabricator

Make OAuth work in Multi-DC active/active mode
Closed, ResolvedPublic

Description

OAuth is almost the last thing left on Redis. It will have to move off for multi-DC.

  • Review the OAuth spec and extension and extract data store requirements
  • Implement those requirements in WMF production
NameTTLDelete on consumeTrafficSolution
OAuth 1.0 request tokens10 minsyeslowmainstash
OAuth 1.0 consumer & callback data10 minsnolowmainstash
OAuth 1.0 nonces5 minsnohighmcrouter
OAuth 2.0 auth codes4 hoursyeslowmainstash
OAuth 2.0 refresh tokens365 daysyeslowmainstash

Event Timeline

mw:OAuth/For Developers is a good source for most of this.

OAuth 1.0

Typical request flow:

  • A request token and an initial access token are created by a GET request to Special:OAuth/initiate.
  • The request token is fetched a short time later during a GET request to Special:OAuth/authorize
  • If there is an existing authorization in DB_REPLICA, Special:OAuth/authorize may redirect to the callback.
  • If there was no existing authorization, the user will submit the form, causing a POST request to Special:OAuth/authorize. This POST request connects to DB_PRIMARY and inserts the authorization, then redirects the user to the callback.
  • The client proceeds with a request to the protected resource using the access token.

So a user near a DC could create an access token with a GET request, and use it in any API request <100ms later.

The OAuth 1.0a spec recommends (with SHOULD not MUST) nonce consumption on every access to a protected resource, to prevent replay attacks. I think it's not worth delaying a response for cross-DC synchronous nonce consumption, since the protection it gives is insignificant when HTTPS is used. It is weak protection in any case, since an attacker can beat the client in a race to use a nonce. MediaWiki implements this requirement by checking the return value of BagOStuff::add() in MWOAuthDataStore::lookup_nonce().

An OAuth 2.0 review is coming.

OAuth 2.0

  • The client initiates authorization with a GET request to /w/rest.php/oauth2/authorize. If user interaction is required, it redirects to Special:OAuth/approve (getApprovalRedirectResponse). If the request is approved without interaction, it redirects to the callback URL. There are multiple grant types, but typically AuthCodeGrant calls issueAuthCode which calls persistNewAuthCode() which calls CacheRepository::set() which calls BagOStuff::add().
  • The client then posts the auth code to /w/rest.php/oauth2/access_token . AuthCodeGrant::respondToAccessTokenRequest() verifies the auth code by fetching it from the cache. Then it creates an access token and a request token. The access token is written to DB_PRIMARY (AccessTokenRepository). The refresh token is written to the BagOStuff after a non-atomic uniqueness check (RefreshTokenRepository). I don't know why access tokens, with an expiry time of 4 hours, are stored permanently in the database, whereas refresh tokens, with an expiry time of a year, are stored in a BagOStuff.
  • The client can get a new access and refresh tokens by posting the refresh token to /w/rest.php/oauth2/access_token . RefreshTokenGrant responds by revoking the old tokens and issuing new ones.

There is no nonce consumption. The revocation UI does not explicitly revoke tokens, but resource requests using access tokens are validated against the database in AccessTokenEntity::confirmClientUsable().

TTL summary

  • OAuth 1.0 request tokens: 10 minutes, deleted when consumed
  • OAuth 1.0 access tokens: 10 minutes
  • OAuth 1.0 callback data: 10 minutes
  • OAuth 1.0 nonces: 5 minutes
  • OAuth 1.0 authorizations: stored in the DB forever
  • OAuth 2.0 auth codes: 4 hours
  • OAuth 2.0 refresh tokens: 365 days, deleted when consumed.
  • OAuth 2.0 access tokens: valid for 4 hours, but stored forever in the DB. Deleted when refreshed.

BagOStuff usage summary

[edit: moved to task description]

I'm not sure we actually enforce OAuth 2 access token expiry.

So a user near a DC could create an [OAuth 1] access token with a GET request, and use it in any API request <100ms later.

IIRC, OAuth 1 access tokens are stored as part of the authorization (oauth_accepted_consumer DB table) and only created when the authorization form is POSTed. Only the request token and the nonce is stored in memcached for OAuth 1.

I'm not sure we actually enforce OAuth 2 access token expiry.

I almost filed a bug along those lines, because oaat_expires is written but never read, and T265075 was not very convincing. But there's an expiry time in the JWT bearer token which is parsed out and used for validity by the library. So I think it is enforced.

IIRC, OAuth 1 access tokens are stored as part of the authorization (oauth_accepted_consumer DB table) and only created when the authorization form is POSTed. Only the request token and the nonce is stored in memcached for OAuth 1.

That's true, thanks. It was confusing me that there are three keys written to memcached in MWOAuthDataStore::new_request_token(), but it seems to be just three items of information relating to the request token, broken out into three keys instead of stored in an array.

I confirmed the traffic guess I did during the code review by capturing Redis traffic to mc1038.

  • 10000 captured packets over 3.3 seconds
  • 5803 were decoded by tcpdump
  • 5492 were ChronologyProtector
  • 211 were OAuth nonces
  • 98 were CentralAuth sessions
  • 1 was an OAuth request token
  • 1 was api-token-blacklist from CentralAuthTokenSessionProvider

So I think the solution is going to be to split the configuration, with nonces in mcrouter and the rest in MySQL mainstash.

One potential complication for multi-DC I can think of is that the OAuth 1 /initiate and /authorize requests are close in time (the application initiates the second as soon as the first is finished), the first writes to the token store BagOStuff, the second needs to be able to read that value from the token store, and they might end up in different DCs (the first comes from the application server, the second from the user). The OAuth spec recommends POST for the /initate and /token requests, but does not mandate it, and I'm not sure that it would make a difference for token store replication in any case.

If the request token data hasn't been replicated yet when /authorize tries to read it, I don't think that's tragic (the user will see an error, and a refresh will make the auth form work), but not ideal. Not sure how the replication time for whatever new medium the token store will use compares to the time between the /initiate and /authorize requests (if they go to different DCs, the user and the server must be far from each other or from the DCs).

One potential complication for multi-DC I can think of is that the OAuth 1 /initiate and /authorize requests are close in time

We can route Special:OAuth/initiate, Special:OAuth/authorize and /w/rest.php/oauth2/authorize to the primary DC in ATS. In general, if a GET request produces a token, and a POST request consumes it, we can avoid complications due to cross-DC replication by sending both to the primary DC. We have such routing already for CentralAuth.

Ideally we don't want special case routing for subsequent resource requests. But UseDC cookies should be sent after the access token is written to the DB, so the client should be OK as long as it supports cookies.

Ideally we don't want special case routing for subsequent resource requests. But UseDC cookies should be sent after the access token is written to the DB, so the client should be OK as long as it supports cookies.

The cookies will be sent to the end user's browser (as the DB write happens when the authorization form is posted by the user), but the subsequent requests are made by the application server, which is a separate client with no access to those cookies (except for some OAuth 2 public apps), and coming from a different IP. From the POV of our network infrastructure, there is nothing linking the two requests.

That said, routing the various authorization-related OAuth endpoints to primary should mostly avoid any potential problems. OAuth 2 bearer tokens are self-contained, all the server needs to verify them is the public key used to sign/encrypt the token (revocation checks do rely on the DB but a small delay in revocations should be insignificant). OAuth 1 does rely on DB lookups, and the logic for it seems slightly broken (MWOAuthDataStore::lookup_token() does primary fallback, but SessionProvider::provideSessionInfo() repeats the same DB read right after without a fallback), but I imagine [/authorize response + redirected request to application server + /token request + /token response + resource request] is enough time for the DB write to replicate.

Change 816882 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/OAuth@master] Configure the nonce cache separately from the session cache

https://gerrit.wikimedia.org/r/816882

Change 816884 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Set cache types for OAuth multi-DC

https://gerrit.wikimedia.org/r/816884

Change 817086 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC routing special cases for OAuth

https://gerrit.wikimedia.org/r/817086

Change 816882 merged by jenkins-bot:

[mediawiki/extensions/OAuth@master] Configure the nonce cache separately from the session cache

https://gerrit.wikimedia.org/r/816882

Change 817860 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/OAuth@wmf/1.39.0-wmf.21] Configure the nonce cache separately from the session cache

https://gerrit.wikimedia.org/r/817860

Change 817861 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/OAuth@wmf/1.39.0-wmf.22] Configure the nonce cache separately from the session cache

https://gerrit.wikimedia.org/r/817861

Change 817860 merged by jenkins-bot:

[mediawiki/extensions/OAuth@wmf/1.39.0-wmf.21] Configure the nonce cache separately from the session cache

https://gerrit.wikimedia.org/r/817860

Change 817861 merged by jenkins-bot:

[mediawiki/extensions/OAuth@wmf/1.39.0-wmf.22] Configure the nonce cache separately from the session cache

https://gerrit.wikimedia.org/r/817861

Mentioned in SAL (#wikimedia-operations) [2022-07-28T01:11:17Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.22/extensions/OAuth: New config var for T313578, not yet used (duration: 03m 39s)

Mentioned in SAL (#wikimedia-operations) [2022-07-28T01:18:32Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.21/extensions/OAuth: New config var for T313578, not yet used (duration: 03m 23s)

Change 816884 merged by jenkins-bot:

[operations/mediawiki-config@master] Set cache types for OAuth multi-DC

https://gerrit.wikimedia.org/r/816884

Mentioned in SAL (#wikimedia-operations) [2022-07-28T01:28:04Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: move OAuth token storage T313578 (duration: 03m 04s)

After ~14 minutes:

MariaDB [mainstash]> select count(*),sum(length(value)) from objectstash where keyname like 'OAUTH%';
+----------+--------------------+
| count(*) | sum(length(value)) |
+----------+--------------------+
|     1020 |              74771 |
+----------+--------------------+

db1141 looks healthy. memcached looks fine -- mcrouter was already doing 9k adds per second so an extra 500/s is not obvious in the dashboards. The expected drop in set commands is visible in the Redis dashboard.

I also confirmed that OAuth enabled tools are still appearing in Commons RC.

Change 817086 merged by Tim Starling:

[operations/puppet@production] Multi-DC routing special cases for OAuth

https://gerrit.wikimedia.org/r/817086

Mentioned in SAL (#wikimedia-operations) [2022-07-29T00:48:16Z] <TimStarling> slowly restarting (with batch 1 sleep 5) trafficserver on text caches to fully deploy g 817086 T313578

Change 901333 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/puppet@production] multi-dc: Use primary for OAuth for both URL forms

https://gerrit.wikimedia.org/r/901333

Change 901333 merged by Ssingh:

[operations/puppet@production] multi-dc: Use primary for OAuth for both URL forms

https://gerrit.wikimedia.org/r/901333

Mentioned in SAL (#wikimedia-operations) [2023-03-23T16:59:43Z] <sukhe> rolling out CR 901333 to A:cp-text T313578