Page MenuHomePhabricator

Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM
Closed, DeclinedPublicPRODUCTION ERROR

Description

After promoting group1 to 1.32.0-wmf.22 I noticed a spike of web request took longer than 60 seconds and timed out. Roughly from 19:24 to 19:28. That happened after the deployment:

19:19:43 <wikibugs> (Merged) jenkins-bot: group1 wikis to 1.32.0-wmf.22 [mediawiki-config] - https://gerrit.wikimedia.org/r/461456 (owner: Hashar)
19:24:50 <logmsgbot> !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.22

From logstash query type:mediawiki AND channel:(fatal OR exception OR error) AND "60 seconds":

Screenshot_20180919_213914.png (515×918 px, 51 KB)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2018-09-19T19:40:59Z] <hashar> web request 60 second timeout when deploying is filled as https://phabricator.wikimedia.org/T204871

The web request times out are enforced since September 10th (see T97192#4561879 and https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090803.html ). We have not deployed a train since then.

I have not paid attention to those errors when promoting group0 yesterday. Looking at group0 the same issue happened:

19:20	<hashar@deploy1001>	rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.22

Screenshot_20180919_215315.png (309×840 px, 30 KB)

Maybe we always had the issue and it is now showing up due to the timeout limit being now respected.

Same happened with group2:

14:38:57 Finished sync-apaches (duration: 00m 38s)

Screenshot_20180920_164349.png (275×1 px, 26 KB)

hashar renamed this task from Promoting group1 to 1.32.0-wmf.22 caused a spam of web request took longer than 60 seconds and timed out to Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out".Sep 25 2018, 10:17 AM
hashar added a project: Wikimedia-Incident.
hashar moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.

Maybe we always had the issue and it is now showing up due to the timeout limit being now respected.

I believe this to be true.

Logs are full of this error message.

(Starting to add subtasks to this that are either instances of or other teams tracking their portions of the problem.)

greg triaged this task as High priority.Oct 18 2018, 10:15 PM

(Prioritizing this general task as "High" whereas some subtasks might be UBN or Normal.)

greg renamed this task from Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out" to Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments.Nov 27 2018, 7:19 PM
greg moved this task from Active Situation to Follow-up prevention on the Wikimedia-Incident board.

Making this a follow-up/actionable to track this work better/more realistically.

Just a note that for today's promotion of group1 to 1.33.0-wmf.13 (T206667), I segmented the group1 error-log dashboard to have a view of just these timeout errors and a view excluding them. It was very helpful in keeping on both the rise in timeouts and side effects or unrelated errors. I plan on saving the dashboards and adding links in the train docs.

group1 - just timeout errors.png (295×1 px, 30 KB)

group1 - excluding timeout errors.png (299×1 px, 29 KB)

The issue about app servers having the first few requests time out after a deploy, naturally, also affects the canaries. As such, this is sometimes causing the endpoint checks, whhich Scap performs against canaries during a deployment, to sometimes fail.

@hashar wrote at T203664:

Spotted while promoting all wikis:

13:05:12 Check 'Check endpoints for mwdebug1002.eqiad.wmnet' failed:
  /wiki/{title} (Main Page) timed out before a response was received;
  /wiki/{title} (Special Version) timed out before a response was received;
  /w/api.php (Main Page pageprops) timed out before a response was received

The above report is about mwdebug (which is slower in general due to being a VM), but I've seen it happen on other canaries as well.

The issue about app servers having the first few requests time out after a deploy, naturally, also affects the canaries. As such, this is sometimes causing the endpoint checks, whhich Scap performs against canaries during a deployment, to sometimes fail.

I've seen that as well, and just as problematic is that it's rather hard to determine the difference between incidental slowness due to the deployment itself vs breakage actually caused by the new code being deployed.

Krinkle renamed this task from Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments to Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM.May 19 2019, 10:14 AM
greg changed the task status from Open to Stalled.Jul 6 2019, 4:58 AM
greg lowered the priority of this task from High to Medium.

Not sufficient need right now.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM