Page MenuHomePhabricator

Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections.
Closed, ResolvedPublic

Assigned To
Authored By
jcrespo
Aug 17 2015, 11:57 AM
Referenced Files
F3068650: phab_again.png
Dec 11 2015, 12:30 PM
F2716292: Screenshot from 2015-10-13 08:10:01.png
Oct 13 2015, 12:16 PM
F2568130: phabricator6.png
Sep 11 2015, 9:27 AM
F2530686: phabricator4.png
Sep 2 2015, 2:16 PM
F2486928: foo.txt
Aug 29 2015, 3:20 PM
F2486374: phabricator.png
Aug 29 2015, 1:49 PM
F2463083: production.png
Aug 27 2015, 2:04 PM
F2463305: phabricator2.png
Aug 27 2015, 2:04 PM
Tokens
"Manufacturing Defect?" token, awarded by Joe.

Description

db1043:

phabricator.png (310×1 px, 22 KB)

The first spike happened at the same time that the server ran out of space, so I didn't gave it any importance. But it happened a second time. According to a user, it only affects one http request and it works again after reloading.

This is not a problem for MySQL (phabricator has a dedicated server, so it does not affect other services) and it is probably not high priority because it only happens at spikes and phabricator, I think, it is not a critical service needing 100% uptime.

But it would be nice to understand the origin (connections reaching max_connections- 5000) and there is a problem at mysql level or if we require to tweak something at connection side or the configuration of the proxy monitoring the database to not perform a failover unless the problem continues for over X seconds. I do not have detailed enough monitoring to be able to measure this at db level, but I can activate it.

It is now happening regularly.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So I've spent a bit of time digging into this issue. I was thinking that it's kind of odd that iridium (phabricator) is consistently receiving 4x the network traffic as what it's sending, however, maybe that's explained by traffic from mysql to phabricator exceeding the traffic from phabricator to web clients? Combine that with the traffic from gerrit to phabricator (repeatedly fetching git repos) and it might actually make sense.

T110710 should help with search engine indexers hitting expensive URLs but I still suspect there is more to it than just regular traffic overloading phabricator - the spikes happened suddenly and unexpectedly, I just don't think that the use of phabricator has gone up that much. In fact, phabricator just isn't at all overloaded or stressed by the current traffic levels. It's barely doing anything, compared to what the machine is capable of. The fact that it's stressing mysql is really troubling.

One possible cause for spikes, that's very much worth mentioning here, is the sprint extension (and to a lesser extent, phragile) ... Hitting the burndown page for a large project like StructuredDiscussions will churn away at mysql queries for the full duration of the web request timeout (T107197). Search indexers hitting those URLs are a likely culprit for some of the spikes.

For all of the gory details, see T107197: Sprint extension doesn't scale to thousands of tasks in a single sprint: burndown page exceeds max execution timeout on visual editor project

These are the process list on one of those peaks:

Most connections are idling- I've set the timeout to 60 (in a non persistent way) to patch that until a real solutions is found.

The Phabricator daemon log (viewable by running /srv/phab/phabricator/bin/phd log on iridium) is full of errors. Is there any causal relationship between these errors and the database issue? Maybe not directly. But they are not irrelevant. Having a system spew errors all over the place makes it unreasonably hard to debug and the source of bad behavior difficult to isolate. Please fix whatever it is causing the daemons to fail constantly.

These are the process list on one of those peaks:

Most connections are idling- I've set the timeout to 60 (in a non persistent way) to patch that until a real solutions is found.

Possibly interesting datapoint: each of phabricator_policy, phabricator_project, phabricator_config, and phabricator_spaces has exactly 213 sleeping connections. phabricator_search has 421, which is five connections shy of being twice 213.

Most connections are idling- I've set the timeout to 60 (in a non persistent way) to patch that until a real solutions is found.

This new setting has probably aborted a new connection spike by killing 3000 idle connections at 6:23. Bad consequences on the application are uknown yet, so waiting to see those before making the change persistent or the timeout lower.

from the process list that @jcrespo posted, it looks to me like several processes are running somewhat slow queries, with each process holding 6 idle connections to mysql in the meantime. Specifically, this query appears multiple times, each with 6 null connections preceding it:

| 3813343808 | phuser          | 10.64.32.150:46677 | phabricator_maniphest  | Query       |        5 | Sorting result                                                        | SELECT `task`.*  FROM `maniphest_task` task  WHERE ((`task`.`priority` < 50) OR ((`task`.`priority`

This query, which seems to be selecting and sorting a relatively large result set, might be the culprit.

this query appears multiple times

I, on purpose, did not include the full query, as this is a public ticket. I can do that on request (for newer queries).

@jcrespo: Can you email me the full query or create a private task to post it? I'd like to try to figure out where that query is coming from and hopefully determine if it is indeed the cause of the spikes.

My email is mmodell at wikimedia

jcrespo closed subtask Restricted Task as Resolved.Sep 2 2015, 10:35 AM

closed blocking task Restricted Task as "Resolved".

If it's resolved, please make it public. Thanks.

closed blocking task Restricted Task as "Resolved".

If it's resolved, please make it public. Thanks.

The content of the ticket is considered private data. The status of the ticket does not impact whether the data is public or private.

The content of the ticket is considered private data. The status of the ticket does not impact whether the data is public or private.

John is right. @Nemo_bis that ticket does not provide more information "full disclosure" about the issue. I created a separate ticket with just the queries, no other extra information (which should be shared here). As a DBA, most of the (few) non-public ticket I create are only due to privacy reasons (they contain data of user activity, which I swore to protect). The only thing I added, for context, is the following picture, which I am happy to share here too :-) :

phabricator4.png (313×834 px, 28 KB)

BTW, I do not think this issue is fully resolved, only it is going better because of the configuration change, but that is merely a temporal patch. Or maybe someone else has done more configuration tweaking? For some reason, phabricator only under-performs during weekends.

For future reference, if it's just data (like queries, or logs, etc) and not an actual task that needs to be private, you can make a new pastebin and adjust its view permissions. That avoids the uncomfortable position of having to leave a task private forever just because someone pasted an unredacted IP or somesuch :)

I think the configuration change is a sensible one, at least for now, until we find a better solution.

Let me do 2 things: solidify the configuration change on puppet, and leave pt-table-checksum running over the weekend analyzing slow queries, which seems it is when those patters appear.

Change 235685 had a related patch set uploaded (by Jcrespo):
Set max idle connection timeout to 60 seconds on Phabricator

https://gerrit.wikimedia.org/r/235685

Change 235685 merged by Jcrespo:
Set max idle connection timeout to 60 seconds on Phabricator

https://gerrit.wikimedia.org/r/235685

Since 2-3 September, the connections spikes are gone. So I had to ask: Who touched something? :-P Or was it an externally-created problem?

My commit should had minimized the effects of the spikes, but now it it not even needed, because the query patterns have gone to a more than normal activity. Similarly, I also do not expect nothing unusual from the query profiling.

We have a ticket at T112135 asking about phabricator dumps being disabled since Aug 28. Looking at that i saw the cronjob for there has been disabled. This happened in https://gerrit.wikimedia.org/r/#/c/234563/

I have 2 questions:

  • how likely is this related to the spikes being gone (Aug 28 != Sep 2-3, but close?)
  • can we re-enable that cron job to get the dumps going again?

When we had issues before, I noticed heavy search engine indexer activity. I committed some changes to the robots.txt to exclude a lot of URL patterns that really shouldn't be indexed by search engines. Other than that, I haven't touched phabricator code and I haven't pulled upstream changes since 3 weeks ago.

@Dzahn, very unlikely, but still, dumps should be done on the slave (they create more overhead than statistics). We can talk about that on that commit.

The spikes are actually back since the 8th, but not re-surfing to the user because of the patch I mentioned. I would wait for this weekend again to see if it gets worse, then close the ticket.

hash searches is not something to worry about, but it is an indicator of how many reads are happening inside the engine:

phabricator6.png (313×1 px, 58 KB)

I re-enabled the dump script for T112135 (follow-up to make it use the slave at T112193).

I did _not_ enable 2 other disabled crons called "bz_header_update" and "bz_comments_update"

also see T112135#1628464

If we also want to re-enable those that would be https://gerrit.wikimedia.org/r/#/c/237513/

I've restarted profiling on the host.

jcrespo reopened subtask Restricted Task as Open.Sep 14 2015, 3:37 PM

I've reopened T110913 to include the profiling I did on phabricator during the weekend. Some scary things there (in terms of performance).

Doesn't look too scary to me, can you elaborate?

Maybe inserting all the commits on the 16GB blob table is one of the reason of slowdowns (pure speculation).

jcsrespo: I see. I don't have a very good understanding of mysql blob performance. I would have assumed that it handles large blobs fairly well.

I believe that phabricator has some other storage backends which we are not taking advantage of, mainly because the primary backend is amazon s3.

I would have assumed that it handles large blobs fairly well.

It generally does, I wonder if large data movement can cause stalls, because of the size. Didn't you (or someone) change the amount of repos used recently?

There is quite a lot of activity on repositories but I didn't think the volume had changed very much in the past several weeks.

There has been a lot of new temp file activity, it seems phabricator has started storing temporary data in the MySQL file storage.

We should probably look into using an alternative storage back-end. Do we have a general file storage service that we could utilize by writing a custom back-end (or modifying the s3 back-end, perhaps?)...

Just for historical perspective, when we first implemented we knew that MySQL was a stopgap store (for how long we didn't know), and decided to follow in the steps of Bugzilla and RT and "see how it goes".

But from upstream:

https://secure.phabricator.com/book/phabricator/article/configuring_file_storage/

MySQL Automatic Free May not scale well.
Local Disk Easy Free Does not scale well.
Amazon S3 Easy Cheap Scales well.
Custom Hard Varies Implement a custom storage engine.

So we use mysql and set

modules/phabricator/data/fixed_settings.yaml:storage.mysql-engine.max-size: 10000000

Not meaningful commentary other than it may be worth asking upstream why they consider MySQL to be not very scalable in this instance in general.

Well, we have the closest thing to amazon s3, which is Swift...

I do not thing MySQL is a great place to store large files. Any relational database in general.

So we could potentially have phabricator store it's files in swift?

So we could potentially have phabricator store it's files in swift?

  1. Yes, but not sure how much work
  2. There are some security concerns I would think as we don't store anything sensitive in Swift now

Sure seems worth exploring though if we are reaching sanity limitations on shoving blob's into mysql. The way Phabricator seems to use the file application makes me think that the excess of repos has some relevance here and we may be in level up territory.

@chasemp: upstream phabricator is almost certainly storing things in s3.

  1. Swift should be similar enough to s3 to make it an easy integration.
  2. This seems like a problem. The file storage back-end of phabricator certainly needs to be trustworthy for sensitive information.

I think there is some difference in API for swift vs s3, as in swift has a subset. paging @fgiunchedi who should know offhand.

jcrespo closed subtask Restricted Task as Resolved.Sep 15 2015, 7:14 AM

My original complain, which is the service loss due to the spikes is solved at mysql configuration, and other changes done at application side. I would close this as resolved, as there is no more user affection nor mysql connection spikes.

The origin of this may not be yet 100% clear and that may be causing future scalability issues, suspected to be related to repo importing/file handling. The actionables for that probably should be a separate ticket(s), once properly discussed. Of course, we can discuss it here, just want to separate the issues and its priority.

re-enabled the 2 Bugzilla related crons:

Notice: /Stage[main]/Phabricator::Tools/Cron[bz_comment_update]/ensure: created
Notice: /Stage[main]/Phabricator::Tools/Cron[bz_header_update]/ensure: created

Closing per @jcrespo's comment above.

jcrespo renamed this task from Phabricator creates MySQL connection spikes to Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections..Sep 29 2015, 6:51 AM

I've reduced the timout of several mysql connection and innodb timouts hoping to avoid large number of idle connections:

MariaDB MISC m3 localhost (none) > SET GLOBAL innodb_lock_wait_timeout = 30;
MariaDB MISC m3 localhost (none) > SET GLOBAL interactive_timeout = 30;
MariaDB MISC m3 localhost (none) > SET GLOBAL wait_timeout = 30;
MariaDB MISC m3 localhost (none) > SET GLOBAL thread_pool_idle_timeout = 30;

I have also disabled all cron jobs, as the time matches with the reports (and they produced the only high activity at that time). I will block any attempt to reenable them on the master, when we have an idle slave.

Having said that, I resay what I did before: our phabricator setup is in a bad state (and probably not phabricator in general, just our configuration, plugins, etc.) and mysql changes can only hide that to an extent, not fix it.

I still don't have any clue to the cause of the idle connections. But I definitely agree that cron jobs shouldn't be running against the master.

I'm going to change the dump cron to the slave and leave the other crons disabled.

Has anyone still seen this lately?

I rewrote the dump to use only the slave and we fixed up some robots.txt
settings that had gone missing. Not since then basically have we seen this
issue.

The underlying issue might still be present, but I don't have any way to verify that. It seems that something is leaking open db connections rather than closing them.

Or perhaps it's just a matter of slow queries piling up, each holding a ton of open connections which add up to shitloads.

@jcrespo: Where are you obtaining the graphs you've posted above? I can't find the metrics in graphite, is it something only DBAs have access to?

@mmodell The graphs are taken from MySQL monitoring "tendril", in particular the m3-master:

https://tendril.wikimedia.org/host/view/db1043.eqiad.wmnet/3306

All LDAP users should have access to it.
(We may integrate this with graphite in the future, but there is not enough resources right now. The reason it is not public is because it has sensitive information, like queries executed).

Screenshot from 2015-10-13 08:10:01.png (637×656 px, 55 KB)

As you can see, there are no longer large spikes, mostly because idle connections are killed sooner now. However, we still have small spikes of idle connections (400/hour << 4500/5 minutes on the original report).

All LDAP users in the ops/nda/wmf groups, yeah.

Thanks. The graphs look pretty reasonable. Although I think there may still be some less-than-ideal behavior in phabricator I am not at all equipped to track down the cause. The symptoms are under control so I'm closing this task yet again.

I made T112776: Implement phabricator database clustering support for tracking the higher-level issues with regard to phabricator scalability.

This is happening again today, according to IRC comments by Quim and Nemo.

jcrespo reopened this task as Open.EditedDec 11 2015, 12:30 PM

Something is creating again 1500 simultaneous connections to mysql:

phab_again.png (468×2 px, 60 KB)

Seems to be a process happening at around the same time for the last 3 days.

I've set SET GLOBAL wait_timeout = 10; to kill idle connections faster, but again, that is a patch, not a fix for a potential underlying issue.

This has not happened during the weekend, next issues should be centralized on T112776.