Page MenuHomePhabricator

Long running mediawiki web requests impacts service availability, specially databases
Open, MediumPublic

Description

For the background on this task, see private, very specific task T149076.

MySQL servers have a watchdog where webrequest user's queries are killed after 300 seconds. Sometimes, the watchdog (because deployment bugs T148790) fails. Sometimes, even if it works as indented, multiple long running queries can affect availability of services such as MySQL. By the time the queries are ongoing, it is too late to do something about them- mysql is saturated. Multiple long requests should be cut short at application server level, not at the lower levels (or in addition).

In theory, this should be handled by configuration such as https://gerrit.wikimedia.org/r/#/c/206440/ -in reality, MySQL queries (when not killed by MySQL watchdog) continue for hours (e.g. T148822). The suspicions is that either the above commit is not working or has been reverted; or queries are not fully killed when the mediawikiki thread request iself errors-out or it is killed, leaving orphan queries (thread handling bug or mysqli bug). Investigate where the issue is, and solve it or workaround it somehow.

Event Timeline

jcrespo merged a task: Restricted Task.
jcrespo added a subscriber: Anomie.

Change 326144 had a related patch set uploaded (by Mark Bergsma):
Set hhvm.server.request_timeout_seconds to 60s

https://gerrit.wikimedia.org/r/326144

Change 326144 abandoned by Mark Bergsma:
Set hhvm.server.request_timeout_seconds to 60s

https://gerrit.wikimedia.org/r/326144

Partially mitigated at T160984#3209072 by setting up a query killer at database side- but that is far from ideal because:

  • Queries continue running after application has abandon hope for them (driver issue)
  • Queries do not seem to follow application timeouts, and it should be better fixed at app layer