Page MenuHomePhabricator

scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002
Closed, DuplicatePublic

Description

Spotted while promoting all wikis:

13:05:12 Check 'Check endpoints for mwdebug1001.eqiad.wmnet' failed:
  /wiki/{title} (Main Page) timed out before a response was received;
  /wiki/{title} (Special Version) timed out before a response was received;
  /w/api.php (Main Page pageprops) timed out before a response was received

13:05:12 Check 'Check endpoints for mwdebug1002.eqiad.wmnet' failed:
  /wiki/{title} (Main Page) timed out before a response was received;
  /wiki/{title} (Special Version) timed out before a response was received;
  /w/api.php (Main Page pageprops) timed out before a response was received

My guess is that the check is the very first request to those hosts and the HHVM bytecode cache has to be primed? That takes more than 10 seconds which might be the timeout for that check.

Event Timeline

I don't think it'd be the HHVM bytecode cache for promoting all wikis since 1.32.0-wmf.20 has been on wikis since Tuesday.

I do wonder if it's related to T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild .

This briefly happened during a swat deploy just now -

<icinga-wm> PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds

It recovered after two minutes.

We see similar issues during manual testing: T215368: First request after a MediaWiki sync times out on mwdebug

Also, saw this today 2-3 minutes after a patch was pulled to mwdebug1002:

19:27 <+icinga-wm> PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:29 <+icinga-wm> RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76873 bytes in 0.258 second response time

The timeouts we see from the endpoints checks on canary servers doesn't itself involve cdb rebuilds, so this presumably isn't meant to depend on T203625. Also because we've seen the endpoint check failures happen on other canaries as well, including canaries that aren't mwdebug VMs.

I suspect that the problem causing endpoint checks to fail on canaries isn't specific to either mwdebug, nor canaries, because we're also seeing T204871, which affects all app servers shortly after a deployment. And the endpoint checks applied to canaries are effectively the same real-user requests reported to Logstash in T204871.