Page MenuHomePhabricator

Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm)
Open, HighPublic

Description

Error

  • Request: Any url, logged-in.
Request from *** via cp1089 cp1089, Varnish XID 796181295
Error: 503, Backend fetch failed at Fri, 28 Jun 2019 15:48:22 GMT
response headers
HTTP/2 503 
date: Fri, 28 Jun 2019 15:48:22 GMT
content-type: text/html; charset=utf-8
content-length: 1818
server: Varnish
x-varnish: 796181294, 505502054, 673725678
via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
vary: Accept-Encoding
age: 0
x-cache: cp1089 int, cp3033 miss, cp3033 miss
x-cache-status: int-remote
server-timing: cache;desc="int-remote"
…

Impact

At a seemingly random point in time, our user's browser can get into a state where it is unable to load certain pages on a wiki. These pages consistently lead to the MediaWiki server responding in a way that causes Varnish to serve up the generic 503 Error page instead.

These pages are not rare or uncommon. They are popular pages and articles that don't seem to have any special features. These pages load fine in most cases, but when the logged-in user's browser reached a certain state, they fail consistently.

Notes

Yesterday I was able to view https://commons.wikimedia.org/wiki/Commons:Video without issue. But today it consistently led to the above error. I can reproduce this from cURL, and have reduced the request headers and cookies to the following bare minimum. I've also copied the page to the Commons sandbox and able to reproduce it there as well. It appears there is something on some wiki pages that triggers it, and I have reduced it to pages with a video.

curl -i 'https://commons.wikimedia.org/w/index.php?title=Commons:Sandbox&oldid=356345750' -H 'authority: commons.wikimedia.org'  -H 'cookie: centralauth_User=Krinkle; centralauth_Token=****;'

I have replaced the centralauth_Token with wildcards. Note that it only fails when the token is a valid one, so you'll have to replace the username and token with your own to reproduce the issue.

@BBlack was able to capture the full response before Varnish rejected it, and found the culprit was that MediaWiki emitted more than the limit of 64 HTTP response headers.

The majority of them (about 50) were from CentralAuth, repeatedly setting the same cookie over and over and over again, like so:

Set-Cookie: commonswikiSession=XXX path=/; secure; httponly
Set-Cookie: centralauth_Session=YYY path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=ZZZ path=/; secure; httponly
Set-Cookie: centralauth_Session=AAA path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=BBB path=/; secure; httponly
Set-Cookie: centralauth_Session=CCC path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=DDD path=/; secure; httponly
Set-Cookie: centralauth_Session=EEE path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=FFF path=/; secure; httponly
Set-Cookie: centralauth_Session=GGG path=/; domain=commons.wikimedia.org; secure; httponly
Set-Cookie: commonswikiSession=HHH path=/; secure; httponly
Set-Cookie: centralauth_Session=JJJ path=/; domain=commons.wikimedia.org; secure; httponly

Where each session ID was different from the previous one.

Even when using the above cURL test case with a page that doesn't produce an error page, it still has a great number of these headers (but just below the 64 limit, thus not causing the more visible problem).

Event Timeline

Krinkle created this task.Jun 28 2019, 4:01 PM
Restricted Application added a project: Operations. · View Herald TranscriptJun 28 2019, 4:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis added a subscriber: CDanis.Jun 28 2019, 4:04 PM

NB that the default limit in Varnish actually allows for slightly fewer than 64 headers -- the first line of an HTTP response is parsed into several pseudo-headers internally, so AIUI there's space for ~59 headers.

Change 519661 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] varnish: temporarily allow more response headers

https://gerrit.wikimedia.org/r/519661

Change 519661 merged by BBlack:
[operations/puppet@production] varnish: temporarily allow more response headers

https://gerrit.wikimedia.org/r/519661

Mentioned in SAL (#wikimedia-operations) [2019-06-28T16:35:10Z] <bblack> Raising varnish max_http_hdr (max allowed applayer response header count) from 64->128 in systemd config and live tuning - https://gerrit.wikimedia.org/r/519661 - T226840

Change 519670 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519670

Change 519670 merged by BBlack:
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519670

Change 519675 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519675

Change 519675 merged by CDanis:
[operations/puppet@production] Increase nginx limits on http resp hdr block size

https://gerrit.wikimedia.org/r/519675

jbond triaged this task as Normal priority.Jul 1 2019, 10:32 AM
BBlack raised the priority of this task from Normal to High.Jul 1 2019, 1:42 PM

Re-setting this to at least High for now, given the criticality of the component involved and the production impacts.

We've pushed patches to Varnish and Nginx configurations for the edge layer which increase their ability to mitigate this issue by allowing a larger amount of outbound response headers.

Both patches are also somewhat risky in that they allocate more memory per connection/request out at the edge layers where efficient volumetric handling of traffic is key. Leaving those tunables at unnecessarily high values for the long term may also allow other related cases to crop up unnoticed and eventually make it hard to ever revert the limits, so we'd like to get these reverted as soon as a fix for the underlying CA problem is deployed successfully.

jbond added a subscriber: jbond.Jul 2 2019, 5:41 PM

@Anomie / @Legoktm - Can you take a look at this? We're out of our depth over here trying to figure out this bug. TL;DR is that some logged-in sessions are getting excess (in the example above, ~50) Set-Cookie headers for auth sessions, with many repeats for the same wiki with different session id numbers, to the point where it's causing us real problems.

Tgr added a subscriber: Tgr.Jul 2 2019, 7:28 PM

The cookie is output in CentralAuthSessionProvider::persistSession which is called fron SessionBackend::save, which is called from Session::save, Session::persist and various similar things. So at least three irregular things should be happening:

  1. some session method is called repeatedly
  2. the session does get saved, despite SessionBackend::save normally being a noop unless the session is over 50% of its expiry time or is dirty (so maybe something is changing a session variable repeatedly)
  3. the session ID gets renewed, even though an existing session ID should be reused (which would mean no Set-Cookie header as setting a cookie that is already set is a noop)

I'd imagine #3 has to do with low-level details of the session storage layer (maybe the read arrives before the write had been fully processed).

Tgr added a comment.Jul 2 2019, 9:42 PM

Annoyingly, adding an X-Wikimedia-Debug header to the request results in an 502.

jijiki added a subscriber: jijiki.Jul 3 2019, 5:27 AM

Change 520395 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/extensions/TimedMediaHandler@master] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520395

Change 520414 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/extensions/TimedMediaHandler@wmf/1.34.0-wmf.11] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520414

Change 520414 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@wmf/1.34.0-wmf.11] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520414

Mentioned in SAL (#wikimedia-operations) [2019-07-03T11:55:49Z] <reedy@deploy1001> Synchronized php-1.34.0-wmf.11/extensions/TimedMediaHandler/: T226840 (duration: 00m 50s)

Change 520395 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@master] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/520395

Tgr added a comment.Jul 3 2019, 1:05 PM

@Anomie tracked this down: a TimedMediaHandler utility method that's called several times in the same request uses new RequestContext() instead of RequestContext::getMain(). RequestContext caches the WebRequest object which caches a reference to the Session object, so every call new RequestContext() created a fresh Session object. Normally that doesn't really make a difference, except when there is "partial" session data in the request, that is, a token cookie but no session cookie (this is normal on browser restart, token cookies expire in a year, session cookies expire when closing the browser), a new session will automatically be created and the session cookie output when the Session is fetched, so every new RequestContext() call results in a new set of SetCookie headers.

I still don't get how this is causing T226754: Video in MediaWiki:Loginprompt causes login session problems - presumably somehow it triggers a session ID change and the CSRF token or some other information that's necessary for the login to succeed and is stored in the session gets lost; but at the start of login there is neither a session nor a token cookie, and at the end there's both, and neither of those situations should trigger the outputting of a new Set-Cookie header. At the start of the login the RequestContext object used by TimedMediaHandler would have its own session with a different ID (since the session ID wasn't in the incoming request) but it's not used in any way where that would matter, as far as I can see...

ema renamed this task from Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) to Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).Jul 3 2019, 1:06 PM
Tgr added a comment.Jul 3 2019, 1:06 PM

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

BBlack added a comment.Jul 3 2019, 1:12 PM

Thanks for chasing this down! After fixing up any further sources of the extra sessions: do we have to do something about clearing out the excess sessions from storage (redis?), or is this mostly an ephemeral sort of problem?

Tgr added a comment.Jul 3 2019, 1:15 PM

Sessions are created all the time, and they are short-lived (one hour I think?). I'm pretty sure creating a dozen sessions in some fairly rare corner cases is not something we need to be concerned about.

Jdforrester-WMF added a subscriber: Jdforrester-WMF.

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

Looks to be exclusively in tests?

BBlack added a comment.Jul 3 2019, 3:56 PM

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

Looks to be exclusively in tests?

Seems like most are in unit tests just looking at the paths, but there are several extensions which appear to use new RequestContext() outside of unit tests, too.

Tgr added a comment.EditedJul 3 2019, 4:24 PM

At a glance:

  • The edit API uses it because the way it works is to construct a fake request to the web edit form based on the API parameters. It's all rather horrible; the MCR rewrite was supposed to replace it with something non-hacky but that project seems to have stalled :(
  • The paraminfo API uses it to render messages in a custom language (I think). It could probably use DerivativeContext instead. But it overrides the user so it won't trigger this bug anyway.
  • The parser uses it to simulate query parameters for transcluded special pages. That probably can't be replaced with DerivativeContext, but it also overrides the user and request so should not be able to trigger session creation.
  • Echo and FileImporter use it for internal API requests. That's wrong, DerivativeContext should be used instead.
  • Flow, VisualEditor and LiquidThreads seem to use it to get an OutputPage separate from the main output. Probably should be using new OutputPage( RequestContext::getMain() )? Although the OutputPage constructor does say not to do that.

Did not look at non-WMF examples and maintenance scripts (where session handling matters little).

ema moved this task from Triage to Watching on the Traffic board.Jul 5 2019, 11:59 AM
TheDJ added a subscriber: TheDJ.Jul 10 2019, 6:33 AM

Maybe an update to the class documentation to make it easier to distinguish between proper use and inproper uses ?

TheDJ added a comment.Thu, Jul 25, 6:33 PM

Is this fixed now ?

Tgr added a comment.Mon, Jul 29, 9:44 AM

Is this fixed now ?

The specific issue of TMH triggering CentralAuth misbehavior is fixed. The more generic issue of CentralAuth misbehavior being easily triggerable via wrong use of Request or RequestContext is not.

Is this fixed now ?

The specific issue of TMH triggering CentralAuth misbehavior is fixed. The more generic issue of CentralAuth misbehavior being easily triggerable via wrong use of Request or RequestContext is not.

Borrowing from http://sweng.the-davies.net/Home/rustys-api-design-manifesto , it seems like this is somewhere on the negative end of interface design. Is there some general way we can protect against these misuses? Is there ever a legitimate reason to emit multiple session cookies? Could we fix it up in the code that emits the cookies?

(Also, is the specific TMH fix actually deployed to all groups yet?)

Change 530885 had a related patch set uploaded (by Paladox; owner: Reedy):
[mediawiki/extensions/TimedMediaHandler@REL1_33] Call RequestContext::getMain() rather than new RequestContext()

https://gerrit.wikimedia.org/r/530885

Anomie moved this task from Inbox to Tracking/Watching on the Core Platform Team board.

Is there ever a legitimate reason to emit multiple session cookies?

When a session ID is changed (which is done at various times for security), we need to send a new cookie with the new ID even if we already sent one with the old ID.

In general, the SessionManager code just emits the cookie to WebResponse whenever it thinks it should. This simplifies the logic since it doesn't have to try to keep track of externalities like whether the cookies are being sent to a real WebResponse or to a FauxResponse or whether the SessionBackend is was destroyed and re-created. WebResponse handles the deduplication in case SessionManager winds up trying to set the same cookie with the same value multiple times, since that's where the actual calls to setcookie() are.

But none of that really has much to do with this task anyway. See below.

Could we fix it up in the code that emits the cookies?

The problem here was that we wound up with multiple cookie-emitters all emitting different values for the cookie. Which in turn was because we wound up with multiple "requests" being created inside of one request. The fix was to clean up the places where the multiple "requests" were being created.

We might in theory make RequestContext's constructor private or protected somehow to guard against new code having that error, although PHP doesn't make that particularly easy to enforce at the code level.

(Also, is the specific TMH fix actually deployed to all groups yet?)

It was included in wmf/1.34.0-wmf.13 which was deployed to all groups on 2019-07-11 at 19:29: https://tools.wmflabs.org/sal/log/AWvigyKCOwpQ-3Pk9abE