Maniphest T203664

mwdebug1002
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	hashar
	Sep 6 2018, 1:07 PM

Description

Spotted while promoting all wikis:

13:05:12 Check 'Check endpoints for mwdebug1001.eqiad.wmnet' failed:
  /wiki/{title} (Main Page) timed out before a response was received;
  /wiki/{title} (Special Version) timed out before a response was received;
  /w/api.php (Main Page pageprops) timed out before a response was received

13:05:12 Check 'Check endpoints for mwdebug1002.eqiad.wmnet' failed:
  /wiki/{title} (Main Page) timed out before a response was received;
  /wiki/{title} (Special Version) timed out before a response was received;
  /w/api.php (Main Page pageprops) timed out before a response was received

My guess is that the check is the very first request to those hosts and the HHVM bytecode cache has to be primed? That takes more than 10 seconds which might be the timeout for that check.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Duplicate		None	T203664 scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002
		Resolved		akosiaris	T212955 Increase mwdebugXXXX hosts CPU

Event Timeline

hashar created this task.Sep 6 2018, 1:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2018, 1:07 PM

I don't think it'd be the HHVM bytecode cache for promoting all wikis since 1.32.0-wmf.20 has been on wikis since Tuesday.

I do wonder if it's related to T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild .

greg mentioned this in T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild .Nov 27 2018, 7:10 PM

greg triaged this task as Medium priority.Jan 4 2019, 5:38 PM

greg added a subtask: T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild .

greg edited projects, added Release-Engineering-Team (Watching / External); removed Release-Engineering-Team.

hashar updated the task description. (Show Details)Jan 4 2019, 6:12 PM

This briefly happened during a swat deploy just now -

<icinga-wm> PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds

It recovered after two minutes.

Jdforrester-WMF subscribed.Feb 6 2019, 12:51 AM

greg mentioned this in T215368: First request after a MediaWiki sync times out on mwdebug.Feb 6 2019, 12:53 AM

We see similar issues during manual testing: T215368: First request after a MediaWiki sync times out on mwdebug

Also, saw this today 2-3 minutes after a patch was pulled to mwdebug1002:

19:27 <+icinga-wm> PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:29 <+icinga-wm> RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76873 bytes in 0.258 second response time

Krinkle removed a subtask: T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild .Feb 6 2019, 2:52 AM

Krinkle added a subtask: T212955: Increase mwdebugXXXX hosts CPU.

The timeouts we see from the endpoints checks on canary servers doesn't itself involve cdb rebuilds, so this presumably isn't meant to depend on T203625. Also because we've seen the endpoint check failures happen on other canaries as well, including canaries that aren't mwdebug VMs.

I suspect that the problem causing endpoint checks to fail on canaries isn't specific to either mwdebug, nor canaries, because we're also seeing T204871, which affects all app servers shortly after a deployment. And the endpoint checks applied to canaries are effectively the same real-user requests reported to Logstash in T204871.

Krinkle closed this task as a duplicate of T204871: Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM.Feb 6 2019, 2:57 AM

Krinkle mentioned this in T204871: Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM.Feb 6 2019, 3:03 AM

akosiaris closed subtask T212955: Increase mwdebugXXXX hosts CPU as Resolved.Feb 7 2019, 9:48 AM

scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002Closed, DuplicatePublicActions

Description

Related ObjectsSearch...

Event Timeline

scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002
Closed, DuplicatePublic
Actions

Related Objects
Search...