Long running mediawiki web requests impacts service availability, specially databases
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• jcrespo
	Oct 28 2016, 1:27 PM

Description

For the background on this task, see private, very specific task T149076.

MySQL servers have a watchdog where webrequest user's queries are killed after 300 seconds. Sometimes, the watchdog (because deployment bugs T148790) fails. Sometimes, even if it works as indented, multiple long running queries can affect availability of services such as MySQL. By the time the queries are ongoing, it is too late to do something about them- mysql is saturated. Multiple long requests should be cut short at application server level, not at the lower levels (or in addition).

In theory, this should be handled by configuration such as https://gerrit.wikimedia.org/r/#/c/206440/ -in reality, MySQL queries (when not killed by MySQL watchdog) continue for hours (e.g. T148822). The suspicions is that either the above commit is not working or has been reverted; or queries are not fully killed when the mediawikiki thread request iself errors-out or it is killed, leaving orphan queries (thread handling bug or mysqli bug). Investigate where the issue is, and solve it or workaround it somehow.

Details

	Subject	Repo	Branch	Lines +/-
	Set hhvm.server.request_timeout_seconds to 60s	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T149421 Long running mediawiki web requests impacts service availability, specially databases
					Restricted Task

Event Timeline

• jcrespo created this task.Oct 28 2016, 1:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2016, 1:27 PM

• jcrespo merged a task: Restricted Task.Oct 28 2016, 1:28 PM

• jcrespo merged a task: Restricted Task.

• jcrespo added a subscriber: Anomie.

elukey subscribed.Oct 31 2016, 4:32 PM

• jcrespo mentioned this in T149633: db1065 paged for NRPE timeout.Nov 3 2016, 11:48 AM

fgiunchedi triaged this task as Medium priority.Nov 29 2016, 11:43 PM

Change 326144 had a related patch set uploaded (by Mark Bergsma):
Set hhvm.server.request_timeout_seconds to 60s

https://gerrit.wikimedia.org/r/326144

gerritbot added a project: Patch-For-Review.Dec 9 2016, 4:35 PM

• GWicke mentioned this in T97192: HHVM request timeouts not working; support lowering the API request timeout per request.Dec 9 2016, 6:29 PM

• jcrespo mentioned this in T160984: Reduce max execution time of interactive queries or a better detection and killing of bad query patterns.Apr 5 2017, 6:33 PM

Change 326144 abandoned by Mark Bergsma:
Set hhvm.server.request_timeout_seconds to 60s

https://gerrit.wikimedia.org/r/326144

Partially mitigated at T160984#3209072 by setting up a query killer at database side- but that is far from ideal because:

Queries continue running after application has abandon hope for them (driver issue)
Queries do not seem to follow application timeouts, and it should be better fixed at app layer

• jcrespo mentioned this in T183983: Re-institute query killer for the analytics WikiReplica.Mar 21 2018, 5:21 PM

• jcrespo added a subtask: Restricted Task.Mar 23 2018, 8:21 AM

Krinkle added a project: Vuln-DoS.May 24 2018, 9:29 PM

• jcrespo mentioned this in T195792: Add support for setting individual query timeout in wikimedia/rdbms.May 28 2018, 7:20 PM

• jcrespo mentioned this in T204346: PHP-timed out requests also emit LoadBalancer::destruct error "you can't run this command now: COMMIT".Oct 1 2018, 8:07 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:53 PM

Krinkle removed a project: HHVM.Oct 3 2019, 3:36 AM

Maintenance_bot removed a project: Patch-For-Review.Oct 3 2019, 4:10 AM

• jcrespo mentioned this in T235572: Compose query for minor edit count.Nov 4 2019, 2:28 PM

Anomie mentioned this in T234450: Special:Contributions requests with a high &limit= caused excessive database load.Dec 2 2019, 4:55 PM

CDanis subscribed.Dec 3 2019, 3:29 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:02 PM