Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments
Open, HighPublic

Description

After promoting group1 to 1.32.0-wmf.22 I noticed a spike of web request took longer than 60 seconds and timed out. Roughly from 19:24 to 19:28. That happened after the deployment:

19:19:43 <wikibugs> (Merged) jenkins-bot: group1 wikis to 1.32.0-wmf.22 [mediawiki-config] - https://gerrit.wikimedia.org/r/461456 (owner: Hashar)
19:24:50 <logmsgbot> !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.22

From logstash query type:mediawiki AND channel:(fatal OR exception OR error) AND "60 seconds":

hashar created this task.Sep 19 2018, 7:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2018, 7:39 PM

Mentioned in SAL (#wikimedia-operations) [2018-09-19T19:40:59Z] <hashar> web request 60 second timeout when deploying is filled as https://phabricator.wikimedia.org/T204871

The web request times out are enforced since September 10th (see T97192#4561879 and https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090803.html ). We have not deployed a train since then.

I have not paid attention to those errors when promoting group0 yesterday. Looking at group0 the same issue happened:

19:20	<hashar@deploy1001>	rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.22

Maybe we always had the issue and it is now showing up due to the timeout limit being now respected.

Same happened with group2:

14:38:57 Finished sync-apaches (duration: 00m 38s)

hashar moved this task from To Triage to Active Situation on the Wikimedia-Incident board.
hashar renamed this task from Promoting group1 to 1.32.0-wmf.22 caused a spam of web request took longer than 60 seconds and timed out to Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out".

Maybe we always had the issue and it is now showing up due to the timeout limit being now respected.

I believe this to be true.

zeljkofilipin moved this task from Backlog 🔙 to Watching 👀 on the User-zeljkofilipin board.
zeljkofilipin added a subscriber: zeljkofilipin.

Logs are full of this error message.

greg added a subscriber: greg.Oct 18 2018, 10:10 PM

(Starting to add subtasks to this that are either instances of or other teams tracking their portions of the problem.)

greg triaged this task as High priority.Oct 18 2018, 10:15 PM

(Prioritizing this general task as "High" whereas some subtasks might be UBN or Normal.)

greg renamed this task from Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out" to Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments.Nov 27 2018, 7:19 PM
greg moved this task from Active Situation to Follow-up/Actionables on the Wikimedia-Incident board.

Making this a follow-up/actionable to track this work better/more realistically.

zeljkofilipin removed zeljkofilipin as the assignee of this task.Dec 20 2018, 5:43 PM
zeljkofilipin moved this task from In Progress 🔨 to Next 🔜 on the User-zeljkofilipin board.

Just a note that for today's promotion of group1 to 1.33.0-wmf.13 (T206667), I segmented the group1 error-log dashboard to have a view of just these timeout errors and a view excluding them. It was very helpful in keeping on both the rise in timeouts and side effects or unrelated errors. I plan on saving the dashboards and adding links in the train docs.

The issue about app servers having the first few requests time out after a deploy, naturally, also affects the canaries. As such, this is sometimes causing the endpoint checks, whhich Scap performs against canaries during a deployment, to sometimes fail.

@hashar wrote at T203664:

Spotted while promoting all wikis:

13:05:12 Check 'Check endpoints for mwdebug1002.eqiad.wmnet' failed:
  /wiki/{title} (Main Page) timed out before a response was received;
  /wiki/{title} (Special Version) timed out before a response was received;
  /w/api.php (Main Page pageprops) timed out before a response was received

The above report is about mwdebug (which is slower in general due to being a VM), but I've seen it happen on other canaries as well.

The issue about app servers having the first few requests time out after a deploy, naturally, also affects the canaries. As such, this is sometimes causing the endpoint checks, whhich Scap performs against canaries during a deployment, to sometimes fail.

I've seen that as well, and just as problematic is that it's rather hard to determine the difference between incidental slowness due to the deployment itself vs breakage actually caused by the new code being deployed.