Page MenuHomePhabricator

Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm)
Closed, ResolvedPublic

Description

Error

  • Request: Any url, logged-in.
Request from *** via cp1089 cp1089, Varnish XID 796181295
Error: 503, Backend fetch failed at Fri, 28 Jun 2019 15:48:22 GMT
response headers
HTTP/2 503 
date: Fri, 28 Jun 2019 15:48:22 GMT
content-type: text/html; charset=utf-8
content-length: 1818
server: Varnish
x-varnish: 796181294, 505502054, 673725678
via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
vary: Accept-Encoding
age: 0
x-cache: cp1089 int, cp3033 miss, cp3033 miss
x-cache-status: int-remote
server-timing: cache;desc="int-remote"
…

capture.png (1×1 px, 112 KB)

Impact

At a seemingly random point in time, our user's browser can get into a state where it is unable to load certain pages on a wiki. These pages consistently lead to the MediaWiki server responding in a way that causes Varnish to serve up the generic 503 Error page instead.

These pages are not rare or uncommon. They are popular pages and articles that don't seem to have any special features. These pages load fine in most cases, but when the logged-in user's browser reached a certain state, they fail consistently.

Notes

Yesterday I was able to view https://commons.wikimedia.org/wiki/Commons:Video without issue. But today it consistently led to the above error. I can reproduce this from cURL, and have reduced the request headers and cookies to the following bare minimum. I've also copied the page to the Commons sandbox and able to reproduce it there as well. It appears there is something on some wiki pages that triggers it, and I have reduced it to pages with a video.

curl -i 'https://commons.wikimedia.org/w/index.php?title=Commons:Sandbox&oldid=356345750' -H 'authority: commons.wikimedia.org'  -H 'cookie: centralauth_User=Krinkle; centralauth_Token=****;'

I have replaced the centralauth_Token with wildcards. Note that it only fails when the token is a valid one, so you'll have to replace the username and token with your own to reproduce the issue.

@BBlack was able to capture the full response before Varnish rejected it, and found the culprit was that MediaWiki emitted more than the limit of 64 HTTP response headers.

The majority of them (about 50) were from CentralAuth, repeatedly setting the same cookie over and over and over again, like so:

Set-Cookie: commonswikiSession=XXX path=/; secure; httponly
Set-Cookie: centralauth_Session=YYY path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=ZZZ path=/; secure; httponly
Set-Cookie: centralauth_Session=AAA path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=BBB path=/; secure; httponly
Set-Cookie: centralauth_Session=CCC path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=DDD path=/; secure; httponly
Set-Cookie: centralauth_Session=EEE path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=FFF path=/; secure; httponly
Set-Cookie: centralauth_Session=GGG path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=HHH path=/; secure; httponly
Set-Cookie: centralauth_Session=JJJ path=/; domain=commons.wikimedia.org; secure; httponly

Where each session ID was different from the previous one.

Even when using the above cURL test case with a page that doesn't produce an error page, it still has a great number of these headers (but just below the 64 limit, thus not causing the more visible problem).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

NB that the default limit in Varnish actually allows for slightly fewer than 64 headers -- the first line of an HTTP response is parsed into several pseudo-headers internally, so AIUI there's space for ~59 headers.

Change 519661 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] varnish: temporarily allow more response headers

https://gerrit.wikimedia.org/r/519661

Change 519661 merged by BBlack:
[operations/puppet@production] varnish: temporarily allow more response headers

https://gerrit.wikimedia.org/r/519661

Mentioned in SAL (#wikimedia-operations) [2019-06-28T16:35:10Z] <bblack> Raising varnish max_http_hdr (max allowed applayer response header count) from 64->128 in systemd config and live tuning - https://gerrit.wikimedia.org/r/519661 - T226840

Change 519670 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519670

Change 519670 merged by BBlack:
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519670

Change 519675 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519675

Change 519675 merged by CDanis:
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519675

jbond triaged this task as Medium priority.Jul 1 2019, 10:32 AM
BBlack raised the priority of this task from Medium to High.Jul 1 2019, 1:42 PM

Re-setting this to at least High for now, given the criticality of the component involved and the production impacts.

We've pushed patches to Varnish and Nginx configurations for the edge layer which increase their ability to mitigate this issue by allowing a larger amount of outbound response headers.

Both patches are also somewhat risky in that they allocate more memory per connection/request out at the edge layers where efficient volumetric handling of traffic is key. Leaving those tunables at unnecessarily high values for the long term may also allow other related cases to crop up unnoticed and eventually make it hard to ever revert the limits, so we'd like to get these reverted as soon as a fix for the underlying CA problem is deployed successfully.

@Anomie / @Legoktm - Can you take a look at this? We're out of our depth over here trying to figure out this bug. TL;DR is that some logged-in sessions are getting excess (in the example above, ~50) Set-Cookie headers for auth sessions, with many repeats for the same wiki with different session id numbers, to the point where it's causing us real problems.

The cookie is output in CentralAuthSessionProvider::persistSession which is called fron SessionBackend::save, which is called from Session::save, Session::persist and various similar things. So at least three irregular things should be happening:

  1. some session method is called repeatedly
  2. the session does get saved, despite SessionBackend::save normally being a noop unless the session is over 50% of its expiry time or is dirty (so maybe something is changing a session variable repeatedly)
  3. the session ID gets renewed, even though an existing session ID should be reused (which would mean no Set-Cookie header as setting a cookie that is already set is a noop)

I'd imagine #3 has to do with low-level details of the session storage layer (maybe the read arrives before the write had been fully processed).

Annoyingly, adding an X-Wikimedia-Debug header to the request results in an 502.

Change 520395 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/extensions/TimedMediaHandler@master] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520395

Change 520414 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/extensions/TimedMediaHandler@wmf/1.34.0-wmf.11] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520414

Change 520414 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@wmf/1.34.0-wmf.11] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520414

Mentioned in SAL (#wikimedia-operations) [2019-07-03T11:55:49Z] <reedy@deploy1001> Synchronized php-1.34.0-wmf.11/extensions/TimedMediaHandler/: T226840 (duration: 00m 50s)

Change 520395 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@master] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520395

@Anomie tracked this down: a TimedMediaHandler utility method that's called several times in the same request uses new RequestContext() instead of RequestContext::getMain(). RequestContext caches the WebRequest object which caches a reference to the Session object, so every call new RequestContext() created a fresh Session object. Normally that doesn't really make a difference, except when there is "partial" session data in the request, that is, a token cookie but no session cookie (this is normal on browser restart, token cookies expire in a year, session cookies expire when closing the browser), a new session will automatically be created and the session cookie output when the Session is fetched, so every new RequestContext() call results in a new set of SetCookie headers.

I still don't get how this is causing T226754: Video in MediaWiki:Loginprompt causes login session problems - presumably somehow it triggers a session ID change and the CSRF token or some other information that's necessary for the login to succeed and is stored in the session gets lost; but at the start of login there is neither a session nor a token cookie, and at the end there's both, and neither of those situations should trigger the outputting of a new Set-Cookie header. At the start of the login the RequestContext object used by TimedMediaHandler would have its own session with a different ID (since the session ID wasn't in the incoming request) but it's not used in any way where that would matter, as far as I can see...

ema renamed this task from Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) to Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).Jul 3 2019, 1:06 PM

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

Thanks for chasing this down! After fixing up any further sources of the extra sessions: do we have to do something about clearing out the excess sessions from storage (redis?), or is this mostly an ephemeral sort of problem?

Sessions are created all the time, and they are short-lived (one hour I think?). I'm pretty sure creating a dozen sessions in some fairly rare corner cases is not something we need to be concerned about.

Jdforrester-WMF subscribed.

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

Looks to be exclusively in tests?

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

Looks to be exclusively in tests?

Seems like most are in unit tests just looking at the paths, but there are several extensions which appear to use new RequestContext() outside of unit tests, too.

At a glance:

  • The edit API uses it because the way it works is to construct a fake request to the web edit form based on the API parameters. It's all rather horrible; the MCR rewrite was supposed to replace it with something non-hacky but that project seems to have stalled :(
  • The paraminfo API uses it to render messages in a custom language (I think). It could probably use DerivativeContext instead. But it overrides the user so it won't trigger this bug anyway.
  • The parser uses it to simulate query parameters for transcluded special pages. That probably can't be replaced with DerivativeContext, but it also overrides the user and request so should not be able to trigger session creation.
  • Echo and FileImporter use it for internal API requests. That's wrong, DerivativeContext should be used instead.
  • Flow, VisualEditor and LiquidThreads seem to use it to get an OutputPage separate from the main output. Probably should be using new OutputPage( RequestContext::getMain() )? Although the OutputPage constructor does say not to do that.

Did not look at non-WMF examples and maintenance scripts (where session handling matters little).

Maybe an update to the class documentation to make it easier to distinguish between proper use and inproper uses ?

Is this fixed now ?

The specific issue of TMH triggering CentralAuth misbehavior is fixed. The more generic issue of CentralAuth misbehavior being easily triggerable via wrong use of Request or RequestContext is not.

Is this fixed now ?

The specific issue of TMH triggering CentralAuth misbehavior is fixed. The more generic issue of CentralAuth misbehavior being easily triggerable via wrong use of Request or RequestContext is not.

Borrowing from http://sweng.the-davies.net/Home/rustys-api-design-manifesto , it seems like this is somewhere on the negative end of interface design. Is there some general way we can protect against these misuses? Is there ever a legitimate reason to emit multiple session cookies? Could we fix it up in the code that emits the cookies?

(Also, is the specific TMH fix actually deployed to all groups yet?)

Change 530885 had a related patch set uploaded (by Paladox; owner: Reedy):
[mediawiki/extensions/TimedMediaHandler@REL1_33] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/530885

Is there ever a legitimate reason to emit multiple session cookies?

When a session ID is changed (which is done at various times for security), we need to send a new cookie with the new ID even if we already sent one with the old ID.

In general, the SessionManager code just emits the cookie to WebResponse whenever it thinks it should. This simplifies the logic since it doesn't have to try to keep track of externalities like whether the cookies are being sent to a real WebResponse or to a FauxResponse or whether the SessionBackend is was destroyed and re-created. WebResponse handles the deduplication in case SessionManager winds up trying to set the same cookie with the same value multiple times, since that's where the actual calls to setcookie() are.

But none of that really has much to do with this task anyway. See below.

Could we fix it up in the code that emits the cookies?

The problem here was that we wound up with multiple cookie-emitters all emitting different values for the cookie. Which in turn was because we wound up with multiple "requests" being created inside of one request. The fix was to clean up the places where the multiple "requests" were being created.

We might in theory make RequestContext's constructor private or protected somehow to guard against new code having that error, although PHP doesn't make that particularly easy to enforce at the code level.

(Also, is the specific TMH fix actually deployed to all groups yet?)

It was included in wmf/1.34.0-wmf.13 which was deployed to all groups on 2019-07-11 at 19:29: https://tools.wmflabs.org/sal/log/AWvigyKCOwpQ-3Pk9abE

It sounds like this particular problem is fixed. Was TMH the only main offender? If so, can @BBlack revert the header limit?

Should we resolve this create a different task for the more general RequestContext issue?

It sounds like this particular problem is fixed. Was TMH the only main offender? If so, can @BBlack revert the header limit?

Yes, if we think TMH was the main offender and it's fixed (sounds like it), we can revert https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519661/1/hieradata/role/common/cache/text.yaml + https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519675/4/modules/tlsproxy/templates/localssl.erb .

Should we resolve this

Yes, I think, once those are reverted

create a different task for the more general RequestContext issue?

Probably, although I don't know how deep that side of it runs.

jcrespo lowered the priority of this task from High to Medium.Dec 24 2019, 9:19 AM
jcrespo subscribed.

Yes, I think, once those are reverted

I've created reverts at https://gerrit.wikimedia.org/r/c/operations/puppet/+/560515 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/560514

Assigning to Traffic (@BBlack?) to merge/modify/discard and then resolve the ticket.

Reducing priority as it seems not active anymore.

ping @BBlack to know if you prefer to make temporary workaround permanent or revert as per previous comment so this can be closed.

Change 530885 abandoned by Jforrester:
Call RequestContext::getMain() rather than new RequestContext()

Reason:
REL1_33 is now EOL.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/ /530885

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

brooke claimed this task.
brooke subscribed.

Marking this resolved as nobody seems to be complaining in the last year and a half :D