Page MenuHomePhabricator

Passing an invalid centralauthtoken causes delayed MW response
Closed, ResolvedPublicSecurity

Description

Steps to replicate the issue:

  • Make an API request with an invalid centralauthtoken parameter.

What happens?:

  • The API sleeps for $wgCentralAuthTokenSessionTimeout seconds (default 3) before returning an error.

What should have happened instead?:
The API should have immediately served an error response.

Details

Risk Rating
High
Author Affiliation
WMF Technology

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The assumption is that it's more likely that a request with a valid centralauthtoken arrives before the token store is updated cross-DC with the token, than a request with a completely invalid centralauthtoken.

The assumption is that it's more likely that a request with a valid centralauthtoken arrives before the token store is updated cross-DC with the token, than a request with a completely invalid centralauthtoken.

Even then, we should at least try to make sure the token is valid before waiting. So at least check its size and maybe even sign it server-side so we can verify at least it's not someone throwing random strings at us.

Right now if I pass almost any CA token value for that parameter, I get one api php worker blocked for 3 seconds.

akosiaris set Security to Software security bug.Mar 18 2025, 9:28 AM
akosiaris added projects: Security, Security-Team.
akosiaris changed the visibility from "Public (No Login Required)" to "Custom Policy".
akosiaris changed the subtype of this task from "Bug Report" to "Security Issue".
akosiaris subscribed.

Switch to a security task, cause it describes a potential DoS vector

For what is worth, all I need to consume all PHP workers, allocated to the external API, is launch at most 3600 instances of the following:

$ time curl -q 'https://en.wikipedia.org/w/api.php?centralauthtoken=foobar'
[snip]
real	0m3.190s

which is probably laughable even in a modest computer today, never mind someone dedicated. A dedicated person/group/threat will also try to thwart all kinds of limits and safeguards by issuing many permutations of all the parameters passed to the API and issue many times the amount of concurrent requests I mentioned above. They could sustain this attack for probably an extended period of time, across thousands of IPs, rendering the API almost, if not entirely inoperable for external users.

Please, let's revisit this assumption.

The assumption is that it's more likely that a request with a valid centralauthtoken arrives before the token store is updated cross-DC with the token, than a request with a completely invalid centralauthtoken.

Is this assumption rooted in reality? The latency between dcs is 36ms. How many false positives we are going to have in case if we remove this wait completely? Will it be under "error budget" or above it? We really should make data-driven decisions instead of random values like 3s (why 3? why not 5?)

On top of all of this, what Joe said is also important. This should at least check for it be cryptographically "correct" before waiting.

Right now there's very roughly one timeout per second (logstash). These are not all API tokens, the mechanism is reused for some other token things like central login tokens. (API tokens are the majority, though.)

There's about 100 hits per second (logstash). I don't think there's any way to tell right now whether hits involve any waiting for replication.

So at least check its size and maybe even sign it server-side so we can verify at least it's not someone throwing random strings at us.

A size check is trivial and would help with random crap (in a very unscientific sampling of just paging through the last 500 logstash results, 2 out of them were ramdom crap). I'm not sure how much signing would help - if someone goes through the effort of generating a fake but correctly formatted token, surely they can just fetch actual token and use that? We would have to do some sort of time-based checksum that also ensures that the token was generated recently.

Also, you can just generate an unbounded number of valid tokens, and use them twice (they get deleted on first use and then result in a miss the second time - I suspect this is where the overwhelming majority of current misses are coming from) so we'd have to do something about that too, for it to have much worth as a DoS mitigation.

Is this assumption rooted in reality?

No idea, it was added in rECAU15be87509bbb: Swap uses of READ_LATEST with new getKeyValueUponExistence() method specifically as a multi-DC preparation, though.

The latency between dcs is 36ms. How many false positives we are going to have in case if we remove this wait completely? Will it be under "error budget" or above it? We really should make data-driven decisions instead of random values like 3s (why 3? why not 5?)

Hm, I wonder if the assumption that we need to wait for replication is valid at all. I think this wasn't the case when the patch above was written, but now CentralAuth tokens use the microstash, ie. mcrouter-primary-dc. Which I think means all requests about the same token go to the same memcached server (in the primary DC)?

Hm, I wonder if the assumption that we need to wait for replication is valid at all.

I guess technically some race is involved because Memcached writes are non-blocking. I think the slowest scenario is:

  • User makes GET request for centralauthtoken, gets routed to secondary DC, server generates a random token, returns it to the user, sends write to primary-DC memcached.
  • User makes POST request with the centralauthtoken, gets routed to primary, sends read to the same server (without cross-DC communication).

It does seem unlikely that the latency between the user and the DC times two would be smaller than the latency between the two DCs. There are some situations where these tokens are used internally, server-to-server, though (I think Echo and now GlobalContributions do that?).

So at least check its size and maybe even sign it server-side so we can verify at least it's not someone throwing random strings at us.

A size check is trivial and would help with random crap (in a very unscientific sampling of just paging through the last 500 logstash results, 2 out of them were ramdom crap). I'm not sure how much signing would help - if someone goes through the effort of generating a fake but correctly formatted token, surely they can just fetch actual token and use that? We would have to do some sort of time-based checksum that also ensures that the token was generated recently.

Also, you can just generate an unbounded number of valid tokens, and use them twice (they get deleted on first use and then result in a miss the second time - I suspect this is where the overwhelming majority of current misses are coming from) so we'd have to do something about that too, for it to have much worth as a DoS mitigation.

Looking at logstash doesn't matter much when we're talking about an attack scenario.

My take on defenses is that it's ok if they're not perfect but just stop the lowest hanging fruit. I also think we should definitely decrease this timeout to less than one second in production, assuming it's possible to use fractions of a second; already half a second is much longer than any reasonable write latency we see in memcached even cross-dc.

I created https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128913, where I didn't reference the task because a security task would probably give away all the information someone might need to find this isse.

Timeout is a float. There are probably a few places where it just uses the default 3 rather than $wgCentralAuthTokenSessionTimeout, though (that setting is specifically for API tokens).

My take on defenses is that it's ok if they're not perfect but just stop the lowest hanging fruit.

So what's the lowest hanging fruit here? A regex? Some sort of signature scheme?

(Also, do you want us to use proper security patches, or Gerrit?)

(Also, do you want us to use proper security patches, or Gerrit?)

For benign-looking config stuff, like the aforementioned public mediawiki-config patch, gerrit should be fine. For mw code changes, etc, it's best practice to post and review those on a private Phab task. They can then either be deployed as security patches or pushed to gerrit and deployed quickly thereafter, during a backport window or manual deploy.

sbassett changed the task status from Open to In Progress.Mar 18 2025, 8:47 PM
sbassett triaged this task as High priority.
sbassett changed Author Affiliation from N/A to WMF Technology.
sbassett changed Risk Rating from N/A to High.
sbassett moved this task from Incoming to Watching on the Security-Team board.

My take on defenses is that it's ok if they're not perfect but just stop the lowest hanging fruit.

So what's the lowest hanging fruit here? A regex? Some sort of signature scheme?

(Also, do you want us to use proper security patches, or Gerrit?)

I was thinking of some form of symmetric cryptographic signature, yes. That should be relatively easy to implement using openssl_encrypt / openssl_decrypt, but maybe there's some good libraries for doing that - I'm not up to date with the PHP landscape on this stuff.

So basically my proposal is:

  • Sign a token with a secret
  • When receiving the signed token back, decrypt it with the same secret
  • if decryption fails, immediately fail, otherwise proceed as before

Longer term, I would love to see some form of signature scheme applied consistently across all tokens emitted by centralauth, including session tokens for web users, but for that I'd have stricter requirements, including being able to verify cheaply the signature at the CDN layer, so that's a much larger scope with much stricter constraints. For now any solution would be ok, given the narrow application of this.

In terms of the patches, I took the approach to add them to gerrit without linking them to this task, as this isn't properly a security issue that would affect other installations, but YMMV.

I was thinking of some form of symmetric cryptographic signature, yes. That should be relatively easy to implement using openssl_encrypt / openssl_decrypt, but maybe there's some good libraries for doing that - I'm not up to date with the PHP landscape on this stuff.

There's MWCryptHash::hmac() (not symmetric cryptography, rather a Whirlpool hash based on a shared secret, but seems appropriate for signing and already used in some places).
PHP supports lots of things via OpenSSL or libsodium, but I'd rather reuse an existing MediaWiki primitive.

Longer term, I would love to see some form of signature scheme applied consistently across all tokens emitted by centralauth, including session tokens for web users, but for that I'd have stricter requirements, including being able to verify cheaply the signature at the CDN layer, so that's a much larger scope with much stricter constraints.

I think it would make a lot of sense to make session cookies, CentralAuth tokens and OAuth 2 tokens more compatible (ie. make session cookies and session cookies contain a JWT that's self-proving aside from invalidation and describes some basic facts like allowed rates). I guess that fits into next year's objective?

I was thinking of some form of symmetric cryptographic signature, yes. That should be relatively easy to implement using openssl_encrypt / openssl_decrypt, but maybe there's some good libraries for doing that - I'm not up to date with the PHP landscape on this stuff.

There's MWCryptHash::hmac() (not symmetric cryptography, rather a Whirlpool hash based on a shared secret, but seems appropriate for signing and already used in some places).
PHP supports lots of things via OpenSSL or libsodium, but I'd rather reuse an existing MediaWiki primitive.

Ofc, it makes sense.

Longer term, I would love to see some form of signature scheme applied consistently across all tokens emitted by centralauth, including session tokens for web users, but for that I'd have stricter requirements, including being able to verify cheaply the signature at the CDN layer, so that's a much larger scope with much stricter constraints.

I think it would make a lot of sense to make session cookies, CentralAuth tokens and OAuth 2 tokens more compatible (ie. make session cookies and session cookies contain a JWT that's self-proving aside from invalidation and describes some basic facts like allowed rates). I guess that fits into next year's objective?

Yes, it does indeed :)

The assumption is that it's more likely that a request with a valid centralauthtoken arrives before the token store is updated cross-DC with the token, than a request with a completely invalid centralauthtoken.

I believe this is a left-over from the original 2015 draft for Multi-DC. Among the various changes since, two factors in particular are no longer the case, and indeed were never implemented in WMF production:

  1. Requests with a centralauthtoken parameter are routed to the primary DC.
  2. Nonce tokens and short-lived chronology protector positions are stored only in the primary DC without replication.

Originally, we thought these request might route to either DC, but we didn't launch that way. We instead route these requests to the primary DC instead (either by centralauthtoken parameter, or in case of ChronologyProtector, by placing a short-lived UseDC=master cookie for a few seconds after saving an edit). (T91820, wikitech: Multi-DC, puppet: multi-dc.lua).

Originally, it was envisioned that these would be stored in Redis, and replicated across data centers. Combined with the previous point, this meant that we needed two layers of indirection: The next request would first need to re-try for a while during which we don't know if the key is gone/invalid or not yet appeared (yeah... that's quite the risk, there were HMAC ideas in place to differentiate easily between the two, but we ended up scrapping it instead).

For storage, this data now resides in primary-dc-mcrouter (aka "MicroStash" in MediaWiki) which means even if such request does end up in the secondary DC for some reason, it will fetch it from the right place and find it immediately.

After the Multi-DC proposal changed, Aaron and I removed similar WaitConditionLoop code from core in 2018 with change 432541 and further simplified and slimmed down in T254634: Determine and implement multi-dc strategy for ChronologyProtector and T275713: Misc ChronologyProtector follow-up (Feb 2021).

There should be no reason to wait for memcached data to appear or replicate across DCs

I brought this up last year as something we can remove, in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1055622/5..6/includes/CentralAuthTokenManager.php#b131 (which adopted MicroStash in CentralAuth as part of T363699).

It also came up in T379505 a few months ago, but I didn't make the connection.

It sounds like we should just get rid of the entire WaitConditionLoop then. @sbassett is it OK to do that via Gerrit, as long as the performance implications aren't too obvious from the summary? It affects a lot of essential code pathways (cross-wiki API requests, SUL3 login, autologin) so a security patch would be more unwieldy.
(Or maybe we can deploy a PrivateSettings config override first which sets the wait time to 0.001 or something, that's essentially the same as disabling it.)

It sounds like we should just get rid of the entire WaitConditionLoop then. @sbassett is it OK to do that via Gerrit, as long as the performance implications aren't too obvious from the summary? It affects a lot of essential code pathways (cross-wiki API requests, SUL3 login, autologin) so a security patch would be more unwieldy.

Yeah, if CI is essential to help test and ensure safety for a proposed patch, then we really don't have a great way around just pushing it to gerrit AFAIK. Like you say, if we can be careful with the code/commit msg so as not to make it extremely obvious what's being fixed, along with a quick, well-timed backport deployment, that's probably the best we can do. I'd hate to security-deploy this with a bunch of temporary ad-hoc config and/or have it blow up in production because we missed some unintended code breakages, etc.

as long as the performance implications aren't too obvious

I think you could even omit the performance concern from the commit message entirely, because it is ultimately just removing unused code. The code was added originally to support the 2015 idea of "Redis as session store with bi-directional replication in a Multi-DC MediaWiki deployment", which never happened. Similar to https://gerrit.wikimedia.org/r/432541.

Or maybe we can deploy a PrivateSettings config override first

@Joe just uploaded this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128913

Thanks! Do you need help getting the patch reviewed? Or someone from the team is on it?

I just reviewed it, I'll schedule backports for later today.

All done: https://gerrit.wikimedia.org/r/q/Ia1a31b1c4516115f116ed4d9852f3bc940d71216

@sbassett (or anyone else with the powers), could you please make this task public again?

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".Apr 3 2025, 3:29 PM