Page MenuHomePhabricator

Intermittent DB connectivity problem on phabricator, needs investigation
Closed, DeclinedPublic

Description

Today @Aklapper reported phabricator being intermittent.

And i confirmed the problem.

I am getting all sorts of errors from Unable to establish a connection to any database host (while trying "phabricator_user"). All masters and replicas are completely unreachable.

to

Unable to Reach Any Database

Event Timeline

Paladox triaged this task as Unbreak Now! priority.Apr 20 2017, 9:14 PM
Paladox created this task.
Paladox added projects: SRE, DBA, Phabricator.
Aklapper lowered the priority of this task from Unbreak Now! to High.Apr 20 2017, 9:49 PM

This is intermittent so I don't see why this should be Unbreak Now

Aklapper renamed this task from Intermitten outage on phabricator, needs investigation to Intermittent DB connectivity problem on phabricator, needs investigation.Apr 20 2017, 9:49 PM

I believe this is caused by persistent and rapid crawling by a robot with IP 90.231.10.86

I've enabled rate limiting in phabricator and @Joe enabled tw_reuse on the sql proxy. These measures seem to have improved the situation but there are still spikes in the graphs:

I also submitted a patch to add the the offending IP to phabricator's ban list: rOPUP4595ab446789: Add 90.231.10.86 to phabbanlist, this crawler is causing outages., however, it will need an opsen to merge it.

I don't think this needs DBA attention.

@mmodell we should try an improve phabricator connections to mysql by trying to reduce how many connections it does. We should also try and make improvements to phabricator to try and reduce what it takes to ddos phabricator.

I've enabled rate limiting in phabricator and @Joe enabled tw_reuse on the sql proxy.

Pretty sure that was @faidon, right? :)

I've enabled rate limiting in phabricator and @Joe enabled tw_reuse on the sql proxy.

Pretty sure that was @faidon, right? :)

duh, you're right! Apologies, I don't really think they are the same person 😁

@mmodell we should try an improve phabricator connections to mysql by trying to reduce how many connections it does. We should also try and make improvements to phabricator to try and reduce what it takes to ddos phabricator.

I enabled rate limiting, that should help with being DOSed by aggressive spiders. I'm not sure what else can be done, really.

mmodell claimed this task.

Conclusion:

  • phabricator can't handle the load imposed by aggressive crawlers:
    • The errors coincide with spikes in network load and the increase was all coming from one IP.
    • I believe that the db connection errors are a symptom of exhausting the connection limit.
    • Phabricator's primitive rate limiting was somewhat effective at throttling the misbehaving crawler and (at least mostly) eliminated connection errors during subsequent spikes.

I don't disagree that these two events correlate and there is probably a causative link between the two. However, this resolution is not enough. This is not a root cause analysis, this is a very superficial way to address an issue (and one that's been causing outages no less).

Questions that remain unanswered: why do we get "Can't connect to MySQL" errors and not "too many connections"? Are we hitting a connection limit on the dbproxy (HAProxy) or the database (MariaDB)? Are we hitting an intended limit (e.g. haproxy's maxconn) or an accidental one somewhere in the stack (e.g. running out of ephemeral ports)? If it's the former, are those limits sensible or could we bump them further and at what cost? If it's the latter, what can we do to fix that? In general, is there something we can do on either the infrastructure side or the Phabricator to alleviate those issues?

@faidon I will do whatever I can to help debug it on the phabricator side, however, I will need help from ops and/or DBAs in order to much more than I have already.

Indeed it may be a haproxy issue, or something else entirely. I believe that in the past we were able to have a lot more open database connections than the number which caused issue this time (~250? that number seems really low to me). It might be worth looking into what has changed in the haproxy / mysql configuration recently.

why do we get "Can't connect to MySQL" errors and not "too many connections"?

One theory: Phabricator has built in database health monitoring, when it detects a database outage it will refrain from making connections to mysql for ~5 seconds in order to prevent hammering the server with connection attempts during an outage. Maybe during that time it just outputs a generic "Can't connect to MySQL" error?

I will have to dig a little to see if this could actually be the explanation or not, I may be missing something.

I don't remember if we set this up previously, but Phabricator supports a persistent flag to enable persistent connections. It is documented in this secret corner:

https://secure.phabricator.com/book/phabricator/article/cluster_partitioning/#advanced-configuration

You do not need to enable "real" database clustering to enable persistent connections: you can configure cluster.databases with just one host.

This is off by default because it tends to make all installs use "a medium amount" of connections: small installs go up from "a small amount" to "a medium amount", while large installs drop from "a large amount" to "a medium amount". This was confusing for small installs and hard to guide them through so it's currently relegated to secret advanced options.

The primary impact of this option is to eliminate exhaustion of inbound ports on the database host. After a connection is closed, the port normally can't be reused for something like ~10-15 seconds depending on system configuration, and active installs can sometimes create and close 65K connections within that window with default configuration.


This upstream task discusses overlaying connections:

https://secure.phabricator.com/T11908

We currently establish one connection per logical database, but do not need to unless the logical databases are physically partitioned. That change is ready to move forward, it's just a lot of very manual gruntwork which we haven't really hit a compelling use case for yet. Were that completed, the number of connections we hold open simultaneously should drop significantly (maybe ~10x).

I wouldn't expect this to help with anything unless you're hitting "too many connections".


Phabricator currently supports vertical partitioning (put Maniphest's database on one physical database machine, Diffusion on a different one, etc) but I think it's huge overkill here and probably not the best approach, and likely to cause more headaches than it fixes problems.

This upstream task also discusses sending read traffic to replicas during normal conditions:

https://secure.phabricator.com/T11056

The path there is probably not exceptionally long but I suspect this is also overkill.


I think Phabricator does not currently distinguish between "too many connections" and other types of service failure, other than credential failure, configuration failure, and setup failure. Although this is probably correct from a traffic routing perspective, it might be helpful for us to better surface the specific connection errors we encounter.


Finally, my general expectation is that Phabricator usually runs out of web CPU looooooong before it runs out of database resources if nothing is set to weird values. Recently, we had a production situation where an ambitious user completely exhausted CPU on four web hosts but only drove about 20% CPU load on the backing database shard. Obviously, this isn't true of all workloads (like the search workloads we saw earlier) but I'd expect it to normally be true of "scraper / penetration test tool / crawler" workloads which just crawl around loading pages.

@epriestley:

Thanks for the very helpful and detailed response. I'd like to hear what @faidon thinks about all of that before I chime in too much, however, I will respond to one point:

Finally, my general expectation is that Phabricator usually runs out of web CPU looooooong before it runs out of database resources if nothing is set to weird values. Recently, we had a production situation where an ambitious user completely exhausted CPU on four web hosts but only drove about 20% CPU load on the backing database shard. Obviously, this isn't true of all workloads (like the search workloads we saw earlier) but I'd expect it to normally be true of "scraper / penetration test tool / crawler" workloads which just crawl around loading pages.

We have a fairly powerful web server here (16 cores, 64 gigs of ram, physical hardware not virtualized) but in this case the connections were exhausted by 1 user on a home connection. Granted, it is a home connection in Sweden and I hear those are a lot faster than the average in the US... but still, something is obviously a little strange. The port exhaustion sounds like it might be a likely culprit but I don't have enough visibility into the database layer to really be able to say.

(Has this happened lately? I'm not aware, so maybe this is lower priority now for us?)

From a DB server point of view we suffered a small issue with the slave (db1048) BBU again a few days ago (T160731#3246659), but that shouldn't have affected anything as it is only used for some reports.
It was basically some minor lag for a couple of minutes.

I haven't seen any signs of this lately but it's possible that we just haven't been hit by the perfect storm of simultaneous crawlers. From memory it seems to me that this happens about 2-3 times every year when we get an especially large spike in connections.

mmodell lowered the priority of this task from High to Medium.May 24 2017, 7:39 PM
mmodell moved this task from To Triage to Misc on the Phabricator board.

This is ancient history.