Maniphest T204871

Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM
Closed, DeclinedPublicPRODUCTION ERROR
Actions

Assigned To

None

Authored By

	hashar
	Sep 19 2018, 7:39 PM

Description

After promoting group1 to 1.32.0-wmf.22 I noticed a spike of web request took longer than 60 seconds and timed out. Roughly from 19:24 to 19:28. That happened after the deployment:

19:19:43 <wikibugs> (Merged) jenkins-bot: group1 wikis to 1.32.0-wmf.22 [mediawiki-config] - https://gerrit.wikimedia.org/r/461456 (owner: Hashar)
19:24:50 <logmsgbot> !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.22

From logstash query type:mediawiki AND channel:(fatal OR exception OR error) AND "60 seconds":

Screenshot_20180919_213914.png (515×918 px, 51 KB)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined	PRODUCTION ERROR	None	T204871 Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM
		Resolved	PRODUCTION ERROR	Addshore	T207313 Some administrative and log actions on Wikidata take longer than 60 seconds and time out

Event Timeline

hashar created this task.Sep 19 2018, 7:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2018, 7:39 PM

hashar mentioned this in T191068: 1.32.0-wmf.22 deployment blockers.Sep 19 2018, 7:40 PM

Mentioned in SAL (#wikimedia-operations) [2018-09-19T19:40:59Z] <hashar> web request 60 second timeout when deploying is filled as https://phabricator.wikimedia.org/T204871

Krinkle moved this task from Untriaged to Nov2019/1.35.wmf.5+ on the Wikimedia-production-error board.Sep 19 2018, 7:50 PM

The web request times out are enforced since September 10th (see T97192#4561879 and https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090803.html ). We have not deployed a train since then.

I have not paid attention to those errors when promoting group0 yesterday. Looking at group0 the same issue happened:

19:20	<hashar@deploy1001>	rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.22

Screenshot_20180919_215315.png (309×840 px, 30 KB)

Maybe we always had the issue and it is now showing up due to the timeout limit being now respected.

hashar mentioned this in T97192: HHVM request timeouts not working; support lowering the API request timeout per request.Sep 19 2018, 7:57 PM

Nikerabbit subscribed.Sep 19 2018, 9:21 PM

Krinkle moved this task from Nov2019/1.35.wmf.5+ to Mar 2021 on the Wikimedia-production-error board.Sep 20 2018, 1:46 AM

Same happened with group2:

14:38:57 Finished sync-apaches (duration: 00m 38s)

Screenshot_20180920_164349.png (275×1 px, 26 KB)

hashar renamed this task from Promoting group1 to 1.32.0-wmf.22 caused a spam of web request took longer than 60 seconds and timed out to Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out".Sep 25 2018, 10:17 AM

hashar added a project: Wikimedia-Incident.

hashar moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.

• Niedzielski mentioned this in T204606: [Spike, 8hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Sep 26 2018, 3:39 PM

Maybe we always had the issue and it is now showing up due to the timeout limit being now respected.

I believe this to be true.

zeljkofilipin awarded a token.Oct 10 2018, 12:31 PM

zeljkofilipin added a project: User-zeljkofilipin.

zeljkofilipin moved this task from Backlog 🪒 to Watching 📺 on the User-zeljkofilipin board.

zeljkofilipin subscribed.

Logs are full of this error message.

Mahir256 mentioned this in T207313: Some administrative and log actions on Wikidata take longer than 60 seconds and time out.Oct 17 2018, 7:11 PM

Mahir256 subscribed.Oct 18 2018, 4:54 AM

Paladox subscribed.Oct 18 2018, 10:07 PM

greg added subtasks: T204606: [Spike, 8hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?, T207313: Some administrative and log actions on Wikidata take longer than 60 seconds and time out.Oct 18 2018, 10:08 PM

(Starting to add subtasks to this that are either instances of or other teams tracking their portions of the problem.)

(Prioritizing this general task as "High" whereas some subtasks might be UBN or Normal.)

Addshore closed subtask T207313: Some administrative and log actions on Wikidata take longer than 60 seconds and time out as Resolved.Oct 29 2018, 5:16 PM

greg moved this task from INBOX to Epics (ARCHIVED) on the Release-Engineering-Team board.Nov 21 2018, 12:11 AM

Making this a follow-up/actionable to track this work better/more realistically.

zeljkofilipin moved this task from Watching 📺 to Deep work 🌊 on the User-zeljkofilipin board.Dec 17 2018, 5:27 PM

• mmodell subscribed.Dec 17 2018, 5:29 PM

zeljkofilipin claimed this task.Dec 17 2018, 5:30 PM

Restricted Application edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team. · View Herald TranscriptDec 17 2018, 5:30 PM

zeljkofilipin moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.Dec 17 2018, 5:30 PM

zeljkofilipin removed zeljkofilipin as the assignee of this task.Dec 20 2018, 5:43 PM

zeljkofilipin moved this task from In-progress to Backlog on the Release-Engineering-Team (Kanban) board.

zeljkofilipin moved this task from Deep work 🌊 to Q1 👔 on the User-zeljkofilipin board.

dcausse mentioned this in T212455: Spike of fatal timeouts from API search suggestions (prefixsearch).Dec 21 2018, 8:51 AM

Just a note that for today's promotion of group1 to 1.33.0-wmf.13 (T206667), I segmented the group1 error-log dashboard to have a view of just these timeout errors and a view excluding them. It was very helpful in keeping on both the rise in timeouts and side effects or unrelated errors. I plan on saving the dashboards and adding links in the train docs.

group1 - just timeout errors.png (295×1 px, 30 KB)

group1 - excluding timeout errors.png (299×1 px, 29 KB)

Krinkle mentioned this in T203664: scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002.Feb 6 2019, 2:57 AM

Krinkle merged a task: T203664: scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002.

Krinkle added subscribers: Krinkle, Tgr, Jdforrester-WMF and 2 others.

Krinkle mentioned this in T215368: First request after a MediaWiki sync times out on mwdebug.Feb 6 2019, 2:59 AM

The issue about app servers having the first few requests time out after a deploy, naturally, also affects the canaries. As such, this is sometimes causing the endpoint checks, whhich Scap performs against canaries during a deployment, to sometimes fail.

@hashar wrote at T203664:

Spotted while promoting all wikis:

13:05:12 Check 'Check endpoints for mwdebug1002.eqiad.wmnet' failed:
  /wiki/{title} (Main Page) timed out before a response was received;
  /wiki/{title} (Special Version) timed out before a response was received;
  /w/api.php (Main Page pageprops) timed out before a response was received

The above report is about mwdebug (which is slower in general due to being a VM), but I've seen it happen on other canaries as well.

Krinkle removed a subtask: T204606: [Spike, 8hrs] Some requests time out after 60 seconds in MobileFrontend transforms. What to do?.Feb 6 2019, 3:04 AM

In T204871#4930433, @Krinkle wrote:

The issue about app servers having the first few requests time out after a deploy, naturally, also affects the canaries. As such, this is sometimes causing the endpoint checks, whhich Scap performs against canaries during a deployment, to sometimes fail.

I've seen that as well, and just as problematic is that it's rather hard to determine the difference between incidental slowness due to the deployment itself vs breakage actually caused by the new code being deployed.

Krinkle merged a task: T208549: HHVM CPU usage when deploying MediaWiki.Feb 15 2019, 6:21 PM

Krinkle added subscribers: Joe, akosiaris, PeterBowman and 3 others.

zeljkofilipin mentioned this in T215380: Content too big! Entity: Q27972199.Feb 28 2019, 12:25 PM

zeljkofilipin moved this task from Q1 👔 to Watching 📺 on the User-zeljkofilipin board.Mar 7 2019, 11:24 AM

zeljkofilipin moved this task from Watching 📺 to Project ♟ on the User-zeljkofilipin board.Mar 14 2019, 1:38 PM

Krinkle merged a task: T215368: First request after a MediaWiki sync times out on mwdebug.May 19 2019, 10:00 AM

Krinkle renamed this task from Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments to Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM.May 19 2019, 10:14 AM

greg added a project: Release-Engineering-Team-TODO.Jul 1 2019, 9:28 PM

greg moved this task from Should be empty (use Release-Engineering-Team) to Soon-ish on the Release-Engineering-Team-TODO board.Jul 1 2019, 9:30 PM

greg removed a project: Release-Engineering-Team (Kanban).Jul 1 2019, 9:31 PM

greg changed the task status from Open to Stalled.Jul 6 2019, 4:58 AM

greg lowered the priority of this task from High to Medium.

greg moved this task from Soon-ish to Later / Need volunteer on the Release-Engineering-Team-TODO board.

Not sufficient need right now.

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Krinkle moved this task from Mar 2021 to Untriaged on the Wikimedia-production-error board.Feb 10 2021, 7:25 PM

TerraCodes unsubscribed.Feb 10 2021, 7:34 PM

	F27915872: group1 - excluding timeout errors.png
	Jan 16 2019, 11:36 PM

	F27915870: group1 - just timeout errors.png
	Jan 16 2019, 11:36 PM

	F26103910: Screenshot_20180920_164349.png
	Sep 20 2018, 2:44 PM

	F26072794: Screenshot_20180919_215315.png
	Sep 19 2018, 7:56 PM

	F26072226: Screenshot_20180919_213914.png
	Sep 19 2018, 7:39 PM

Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVMClosed, DeclinedPublicPRODUCTION ERRORActions

Description

Related ObjectsSearch...

Event Timeline

Investigate the spikes of "web request took longer than 60 seconds and timed out" during deployments on HHVM
Closed, DeclinedPublicPRODUCTION ERROR
Actions

Related Objects
Search...