To limit the impacts of scraping, we want to empower infrastructure layers in front of MediaWiki (e.g. Varnish or Envoy) to make throttling decisions that take into account the identity of the client (e.g. is it a trusted bot?). This requires being able to partially interpret session tokens used by MediaWiki, before MediaWiki processes the request.
(Partially as in, we need to be able to verify that a given session token was valid in the recent past and which used it belonged to. We don't necessarily need to be able to verify that it is still valid.)
For an overview of MediaWiki session tokens, see T392633: [WE5.5.3 research spike] Inventory of current MediaWiki session authentication mechanisms.
There are multiple ways in which this could be done:
Session info endpoint
Create a session info endpoint in MediaWiki. This could be a REST API, or a PHP entry point for a slight speedup. It runs Setup.php (which, among other things, sets up SessionManager and uses it to identify the user based on the session providers various MediaWiki extensions have registered, and autocreates the user if needed), and returns basic information about the user (central user ID, whether they have the highratelimit right, whether the request is performed through an OAuth app etc).
Infrastructure layers before MediaWiki (e.g. Varnish or Envoy) can fork an incoming request to this endpoint to obtain information relevant for throttling. Since MediaWiki request initialization is relatively expensive, this cannot be done for every request, and it would be that layer's responsibility to persist that information during subsequent requests from that client, e.g. in the form of an edge-managed session cookie.
Strengths:
- Minimal amount of work (at least on the MediaWiki side), could be done with <100 lines of code.
- Single solution that is completely decoupled from session handling options / changes in MediaWiki.
Weaknesses:
- Slight performance hit for new sessions (Setup.php is 10% of the processing time for the average REST API request)
- Too expensive to call on every request, so only useful to the extent the edge can identify requests as coming from the same client. E.g. for a scraping botnet that uses a very large, nonspecific IP pool and a nonspecific user agent, it wouldn't really help.
- Whatever persistence mechanism is used outside of MediaWiki can get out of sync. E.g. the user logs out and logs back in as a different user, but the old edge-managed session cookie is still present so the edge layer sees no reason to invoke the session info endpoint again, and still sees the previous user identity.
- Probably requires all clients to honor Set-Cookie headers, which might not be a given.
Turn all session tokens into JWTs
OAuth 2 session tokens (the access tokens that the client needs to provide in the Authorization header) are JWTs which can be decrypted by any part of the infrastructure that has access to the OAuth public key, and contain some user information (expiry, central user ID, rate limits etc). There is no way to fake these tokens without the private key. All the other session mechanisms use random tokens which are opaque to the user; these tokens could be given a similar structure.
Strengths:
- Various parts of the infrastructure could interpret and partially validate session tokens on their own, without any dependency on MediaWiki or a central service. (Partially as in, it couldn't account for invalidation mechanisms for once-valid tokens, such as logout or the revocation of OAuth authorization.)
- Performance hit is minimal (just the cost of decrypting a JWT).
- Reuses (sort of) an existing mechanism - this is already done by the API gateway for OAuth 2.
- Standardizing session token structure seems like a good evolution for MediaWiki core generally, easy to make use of this capability in non-Wikimedia installations.
- Standardizing session token structure could allow treating non-MediaWiki-centric applications (e.g. Gerrit) the same way at the edge as long as their authentication mechanisms can be configured to output a similar token.
- Conceptually decouples session creation (you need the JWT private key) and session validation (you only need the public key), which is a big step towards T391784: Gradually isolate mediawiki authentication code and infrastructure (ie. limiting the extent of damage an attacker can do with a short-lived remote code execution exploit in MediaWiki).
Weaknesses:
- Significant amount of work. There are nine different kinds of session tokens currently used by MediaWiki; we might introduce new ones in the future. For some of them the change might be nontrivial (e.g. require a DB schema change because the token gets larger).
- Even if all session tokens use the same structure, there is still a lot of diversity in where these tokens might be found. (Authorization header, various cookies, query parameters...) Cookie parsing can be awkward in some layers of the infrastructure (e.g. Varnish).
- If a request contains multiple session tokens (e.g. both an OAuth header and a session cookie), it's hard to ensure that MediaWiki and other parts of the infrastructure agree on how to prioritize them.
- Coupled with MediaWiki session handling logic. Future changes to session handling, new mechanisms etc. will need to take infrastructure requirements into account.
- Session token size increases significantly, which means requests get larger.
- Some disruption to users as old tokens become invalid. Some of these tokens (notably user token cookies and OAuth 1 access tokens) are meant to be long-lived.
Turn some session tokens into JWTs, deprecate the rest
Like the previous option, but only for a small number of recommended authentication mechanisms; probably OAuth 2 and one session cookie. (Which could be one of the existing cookies, e.g. the CentralAuth user token cookie, but it's probably better to just define a new cookie and to make sure every cookie-based session provider outputs and verifies it. Eventually that cookie might replace other session mechanisms: T354910: Create simple MediaWiki session handler for remote login) Everything else gets treated as unauthenticated, in terms of rate limiting.
Paired with a migration effort of important tools from non-recommended authentication mechanisms (OAuth 1 and bot passwords).
Strengths:
- Preserves some of the benefits of the previous proposal: decentralized, performant.
- Small amount of work - OAuth 2 already uses JWTs, user token cookies are arbitrary and we might want to change them anyway for security reasons (see e.g. T209586: Make the stored session id be a hash of the used session id to isolate them or the introduction of a HMAC step in User::getToken()).
- The migration effort would be somewhat valuable on its own. OAuth 1 is a very outdated protocol, few other sites use it. It doesn't have an equivalent of OIDC so we are maintaining a homegrown OIDC-on-OAuth-1 protocol, which is not ideal.
Weaknesses:
- Still coupled with MediaWiki session logic
- More confusing rate-limiting behavior from a user point of view
- Does not really help with decoupling session creation and session validation, which would be the main motivation for the previous option
- The migration effort might be significant - most of the community tooling uses OAuth 1, and bot passwords were created specifically to avoid having to change the code of old unmaintained bots.