Error
- Request: Any url, logged-in.
Request from *** via cp1089 cp1089, Varnish XID 796181295 Error: 503, Backend fetch failed at Fri, 28 Jun 2019 15:48:22 GMT
HTTP/2 503 date: Fri, 28 Jun 2019 15:48:22 GMT content-type: text/html; charset=utf-8 content-length: 1818 server: Varnish x-varnish: 796181294, 505502054, 673725678 via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1) vary: Accept-Encoding age: 0 x-cache: cp1089 int, cp3033 miss, cp3033 miss x-cache-status: int-remote server-timing: cache;desc="int-remote" …
Impact
At a seemingly random point in time, our user's browser can get into a state where it is unable to load certain pages on a wiki. These pages consistently lead to the MediaWiki server responding in a way that causes Varnish to serve up the generic 503 Error page instead.
These pages are not rare or uncommon. They are popular pages and articles that don't seem to have any special features. These pages load fine in most cases, but when the logged-in user's browser reached a certain state, they fail consistently.
Notes
Yesterday I was able to view https://commons.wikimedia.org/wiki/Commons:Video without issue. But today it consistently led to the above error. I can reproduce this from cURL, and have reduced the request headers and cookies to the following bare minimum. I've also copied the page to the Commons sandbox and able to reproduce it there as well. It appears there is something on some wiki pages that triggers it, and I have reduced it to pages with a video.
curl -i 'https://commons.wikimedia.org/w/index.php?title=Commons:Sandbox&oldid=356345750' -H 'authority: commons.wikimedia.org' -H 'cookie: centralauth_User=Krinkle; centralauth_Token=****;'
I have replaced the centralauth_Token with wildcards. Note that it only fails when the token is a valid one, so you'll have to replace the username and token with your own to reproduce the issue.
@BBlack was able to capture the full response before Varnish rejected it, and found the culprit was that MediaWiki emitted more than the limit of 64 HTTP response headers.
The majority of them (about 50) were from CentralAuth, repeatedly setting the same cookie over and over and over again, like so:
Set-Cookie: commonswikiSession=XXX path=/; secure; httponly Set-Cookie: centralauth_Session=YYY path=/; domain=commons.wikimedia.org; secure; httponly Set-Cookie: commonswikiSession=ZZZ path=/; secure; httponly Set-Cookie: centralauth_Session=AAA path=/; domain=commons.wikimedia.org; secure; httponly Set-Cookie: commonswikiSession=BBB path=/; secure; httponly Set-Cookie: centralauth_Session=CCC path=/; domain=commons.wikimedia.org; secure; httponly Set-Cookie: commonswikiSession=DDD path=/; secure; httponly Set-Cookie: centralauth_Session=EEE path=/; domain=commons.wikimedia.org; secure; httponly Set-Cookie: commonswikiSession=FFF path=/; secure; httponly Set-Cookie: centralauth_Session=GGG path=/; domain=commons.wikimedia.org; secure; httponly Set-Cookie: commonswikiSession=HHH path=/; secure; httponly Set-Cookie: centralauth_Session=JJJ path=/; domain=commons.wikimedia.org; secure; httponly
Where each session ID was different from the previous one.
Even when using the above cURL test case with a page that doesn't produce an error page, it still has a great number of these headers (but just below the 64 limit, thus not causing the more visible problem).