Page MenuHomePhabricator

beta labs not responding; API shows 503 from varnish
Closed, ResolvedPublic

Description

Neither http://en.wikipedia.beta.wmflabs.org/ nor http://en.wikipedia.beta.wmflabs.org/w/api.php are responding right now.

After a long wait, http://en.wikipedia.beta.wmflabs.org/w/api.php returns with

Request: GET http://en.wikipedia.beta.wmflabs.org/w/api.php, from 67.1.150.67 via deployment-cache-text02 frontend ([10.68.16.16]:80), Varnish XID 760419132
Forwarded for: 67.1.150.67
Error: 503, Service Unavailable at Fri, 25 Jul 2014 14:24:01 GMT


Version: unspecified
Severity: major
See Also:
https://github.com/facebook/hhvm/issues/2531

Details

Reference
bz68574

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:34 AM
bzimport set Reference to bz68574.

Bryan kicked HHVM and things are back (just slow for me). But still investigating.

16:57 < bd808> greg-g: [19:23] < ori>I OK, I merged the config change for Labs, so we'll probably know within the next hour or so if we have additional bugs on our hands
16:57 < bblack> hah
16:58 < greg-g> 19:23 what time?
16:58 < bd808> MDT
16:59 < bblack> so that puts it about 1:22 before the last event to udplog
17:00 < bd808> so ... in a hour we might be hosed again?
17:00 < bblack> probably :)

17:02 < bd808> ori: hhvm on both beta servers was borked. ps showed hundreds of zombie sh processes with hhvm as the parent.

The last event seen in logstash was at 2014-07-25T14:45:04.835Z. Ori's irc message would have been around 2014-07-25T01:23Z.

In apache error logs I see lots and lots and lots of:

[Fri Jul 25 14:45:59.516788 2014] [proxy_fcgi:error] [pid 17215] (70014)End of f
ile found: [client 10.68.16.12:62752] AH01075: Error dispatching request to :
[Fri Jul 25 14:46:01.058387 2014] [proxy_fcgi:error] [pid 17007] [client 10.68.16.12:62750] AH01067: Failed to read FastCGI header

/var/log/hhvm/error.log lamely doesn't contain timestamps. I didn't see anything obvious in there however.

  • Bug 68684 has been marked as a duplicate of this bug. ***

So is this an upstream issue resembling https://github.com/facebook/hhvm/issues/2531 ? Or do we (Wikimedia) plan to investigate a workaround/fix ourselves here ("high priority" set)?

We talked about the issue during the RelEng/QA weekly checkin. There is an engineer of Facebook in WMF office for a month and the HHVM folks attempt to gather as much stacktrace/crashes as possible to get them documented for later investigation. There is a lot of changes being made to hhvm code base or configuration to finely prepare it for production.

In short: beta cluster is going to be unstable for a few :-/

The long term would be to create a new cluster dedicated to run browser tests QA which would be updated only once a day or so. That should be more stable. We track that as Bug 65127 - Setup multiversion on Beta Cluster for nightly build browser testing support.

bsimmers wrote:

I'm working on this. I'm pretty sure it's a bug in hhvm's fastcgi server.

I think this particular bug is long since fixed in our inetal HHVM builds (and probably in the upstream by now). Brett, Ori, Giuseppe can you confirm?