Page MenuHomePhabricator

phabricator close to saturate its database connections
Closed, ResolvedPublic

Description

There are right now around 400-500 open queries running (some for over 9 hours) on phabricator.

| 92631719 | phuser               | 10.64.0.198:41574 | phabricator_project      | Query       |   33663 | statistics                                                                  | SELECT `p`.* FROM `project` p JOIN `project_datasourcetoken` `token_0` ON `token_0`.projectID = p.id |    0.000 |
| 92631746 | phuser               | 10.64.0.198:41628 | phabricator_project      | Query       |   33663 | statistics                                                                  | SELECT `p`.* FROM `project` p JOIN `project_datasourcetoken` `token_0` ON `token_0`.projectID = p.id |    0.000 |
| 92631808 | phuser               | 10.64.0.198:41752 | phabricator_project      | Query       |   33661 | statistics                                                                  | SELECT `p`.* FROM `project` p JOIN `project_datasourcetoken` `token_0` ON `token_0`.projectID = p.id |    0.000 |
| 92631879 | phuser               | 10.64.0.198:41894 | phabricator_project      | Query       |   33657 | statistics                                                                  | SELECT `p`.* FROM `project` p JOIN `project_datasourcetoken` `token_0` ON `token_0`.projectID = p.id |    0.000 |
| 92637367 | phuser               | 10.64.0.198:52812 | phabricator_project      | Query       |   33483 | statistics                                                                  | SELECT `p`.* FROM `project` p JOIN `project_datasourcetoken` `token_0` ON `token_0`.projectID = p.id |    0.000 |

This is not an unbreak now right now yet, but it will likely become one once the connections grow a little more, enough for phabricator to stop responding user requests.

This is the full query:
{P4463}

The query has started being run since 7:14 AM UTC today aprox.

Event Timeline

jcrespo created this task.Nov 17 2016, 4:53 PM
Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptNov 17 2016, 4:53 PM
jcrespo renamed this task from pbraicator close to saturate its database connections to phabiicator close to saturate its database connections .Nov 17 2016, 4:53 PM
Aklapper renamed this task from phabiicator close to saturate its database connections to phabricator close to saturate its database connections .Nov 17 2016, 5:00 PM
mmodell triaged this task as High priority.Nov 17 2016, 5:03 PM

I am going to kill them now unless someone believes that 400 simultaneous 9 hours queries are indeed vital, but I assume they will reappear.

Mentioned in SAL (#wikimedia-operations) [2016-11-17T17:11:14Z] <twentyafterfour> restart apache2 on iridium to clear lagged queries refs T150965

This seems to be related to fulltext search, I'm still investigating further.

https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1043&from=now-24h&to=now

This graph is interesting, despite having less QPS (something typical on upgrades, less or more activity is normal), the query latency has become very irregular.

As I may disconnect soon, I've left a screen on db1043 killing queries running for over 600 seconds called 878.kill-long-queries as a mitigation strategy. Any op can kill that process in case this interferes with any fix or it is no longer needed and we need to test the fix.

Mentioned in SAL (#wikimedia-operations) [2016-11-17T17:40:17Z] <twentyafterfour> phabricator: deploying hotfix for T150965

ok so I just deployed a hotfix which should limit those token_* queries to no more than 5 tokens and I filtered short words which could have a lot of matches. That should drastically reduce the size of the joins for that specific query.

@jcrespo: The hotfix should be live, assuming it works (I'm pretty sure it will) then your mitigation script may not be needed.

@jcrespo: Thanks for catching this, I wouldn't have noticed until everything died.

Paladox added a comment.EditedNov 17 2016, 5:50 PM

@mmodell I wonder what changed since the last phabricator update to cause this problem?

jcrespo lowered the priority of this task from High to Low.Nov 17 2016, 5:54 PM

I have changed the script to print rather than to kill, will close this tomorrow and stop the script if I get no other instance of a long running query.

@mmodell you may want to have a look at https://tendril.wikimedia.org/report/slow_queries?host=db1043&user=&schema=&qmode=eq&query=&hours=12 for other slow queries that may be new (with lower priority).

jcrespo moved this task from Triage to In progress on the DBA board.Nov 17 2016, 6:29 PM
mmodell closed this task as Resolved.Nov 17 2016, 10:23 PM
mmodell claimed this task.