Page MenuHomePhabricator

Request to increase active connection quota for user s51187 on analytics.db.svc.eqiad.wmflabs
Closed, ResolvedPublic

Description

User: s51187

Today we briefly switched the old Edit Counter to redirect to the new one, and almost immediately got the error:

An exception occured in driver: SQLSTATE[42000] [1226] User 's51187' has exceeded the 'max_user_connections' resource (current value: 10)

This is using enwiki.analytics.db.svc.eqiad.wmflabs.

What is the limit on the old replicas? The old Edit Counter is still using those, and it consumes around six concurrent connections per request, and while it did go down from time to time, it was mostly infrequent. We are also using six concurrent connections with the new Edit Counter (minus the temporary tables that caused problems like T172882). Our traffic hasn't really changed, so I'm led to believe the old replicas had a higher limit.

That being said, the new replicas are so fast we don't really need to do as many concurrent connections :) So I can and will trim it down some, but given the traffic we receive, it'd be super awesome if we could somehow get a little more quota. Is this possible? I am also giving our API server a different database user, which will help some.

Event Timeline

Or perhaps that's just it... the connection limit was higher on tools.labsdb where the temporary tables were stored. With the new Edit Counter, all queries are directly on the replicas. We definitely don't want to do temporary tables, and furthermore from my understanding this system of JOINing private database tables against core tables won't be future-proof anyway.

This user has no connection limits on the old servers (I guess for historic reasons?).
Right now, it is indeed limited to 10 connections like any other one.

And you are right, in the new replicas, it is not possible to join private databases against core tables...

chasemp triaged this task as Medium priority.EditedOct 6 2017, 2:12 PM

We did profile things and put in per user limits on the new setup for sanity as one of the common damaging patterns we see is a rogue Tool spawning huge numbers of connections and flooding things. 10 was based on the patterns seen at the time. We talked then about the need for appeal per Tool, but I'm not sure what the status of that mechanism is. @madhuvishy, is it possible to specify a per Tool override of the default connection limit ceiling atm?

Note *.analytics.db.svc.eqiad.wmflabs hosts are supposed to be for very long-running connections (1h), and 10 concurrent connections of those would be even excessive (the plan is, whenever there starts to be a shortage on resources, to limit the number of concurrent connections on analytics and decrease the max query time on web.

Connecting to .*.web, and getting your concurrency limit increased, if they come from web requests and showing willingness to create only short lived queries (by setting a max_statement_time would be ok. Creating 6 concurrent connections per request, however, is a bit crazy- 7 concurrent http requests and you are doing more queries than the server can handle at a time-and you are sharing it with thousands of other users. You should anyway control the maximum concurrency on app side, as a simple DOS attack would make your tool unavailable.

Note *.analytics.db.svc.eqiad.wmflabs hosts are supposed to be for very long-running connections (1h)

Yes let's just go ahead and change this request to be about *.web. I wasn't sure what qualified as a "long" running query. We certainly don't have anything taking anywhere close to an hour.

Connecting to .*.web, and getting your concurrency limit increased, if they come from web requests and showing willingness to create only short lived queries (by setting a max_statement_time would be ok.

We actually have a query killer but I don't think it's working. Instead we could (should) use the max_statement_time option, as you say. What would a reasonable timeout be? If possible, we'd love to give people the data they want, even if it takes forever. The longest queries I've seen that actually complete last around 15-20 minutes. This would generally only happen either with the Edit Counter or the Pages Created tool. For the latter, I think you hinted there is room for improvement at T177096#3657823, and indeed if you have any insight on how to improve that query please feel free to enlighten me :)

I also had the idea of doing EXPLAINs before running any query, and if it's crazy high it just won't even try. Right now we're merely checking the user's edit count and if it's greater than the limit (currently 350,000), it will abort. I've found this to only be somewhat effective. In some cases the Edit Counter might finish, but for the same user the Pages Created tool does not. Perhaps adding a timeout is the better solution, and we'd just let people try the tools and see if they finish?

Creating 6 concurrent connections per request, however, is a bit crazy- 7 concurrent http requests and you are doing more queries than the server can handle at a time-and you are sharing it with thousands of other users. You should anyway control the maximum concurrency on app side, as a simple DOS attack would make your tool unavailable.

I was modeling after the old XTools, which I should know by now is not a good example of anything =P We have some means to safeguard against DDoS, but let's not test it! ;) Anyway, as I said the new replicas are so freakin' fast we don't need to do concurrent requests to attain the speeds we had with the old tool. In fact from recent tests, async doesn't even help that much. I've removed it entirely for now, so we're back to 1 connection per request. However I'm still occasionally seeing the same error about exceeding max_user_connections, both on the app server and API server, which are now using different database users.

Note *.analytics.db.svc.eqiad.wmflabs hosts are supposed to be for very long-running connections (1h)

Yes let's just go ahead and change this request to be about *.web. I wasn't sure what qualified as a "long" running query. We certainly don't have anything taking anywhere close to an hour.

How many you think you'd need on *.web?

How many you think you'd need on *.web?

I'm not sure :/ If you are concerned about giving out too many, maybe we could start with say, 30, and see if the errors go away?

Since I changed the Edit Counter back to being synchronous (one connection per request), the main app server hasn't encountered the max_user_connections error. The API server however is regularly getting it, evidently due to the popularity of the ArticleInfo gadget. It is connecting with the user s53003. I'm not sure if you're willing to give it and s51187 extra bandwith, but if not just stick with s51187 as requested and I'll have the API server use that.

I gave 30 to s51187 on .web.
Let's see how that goes...

Your comment was at 06:22 UTC on 9 October. Here's what I've got when grepping the logs (stacktraces removed):

musikanimal@xtools-prod02:/var/www$ tail -n 1000 var/logs/prod.log | grep max_user_connections

[2017-10-09 19:24:20] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:03] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:13] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:13] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:13] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:13] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:13] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:13] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:13] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:14] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-09 19:29:15] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"

musikanimal@xtools-prod02:/var/www$ cat app/config/parameters.yml | grep database_replica
    database_replica_host: enwiki.web.db.svc.eqiad.wmflabs
    database_replica_user: s51187

It still says "current value: 10". Are we sure the change went through?

I wanted to also mention we're going to look into T175763, which will probably help.

Scratch that, the logs clearly say s53003 and not s51187 =P The cache has been cleared so it should be using what's in parameters.yml now... I'll just wait and see if it happens again. Sorry for the noise!

Cool - let me know how it goes.
I didn't want to add exceptions to two users, if we can avoid it.

This seems to have done the trick. The last max connections error on the API server was on 2017-10-09, as reported above. Thank you! :)

However... our main app server is now getting this error :( It is using user s53003. Here are the logs with stack traces removed:

musikanimal@xtools-prod01:/var/www$ tail -n 10000 var/logs/prod.log | grep max_user_connections
[2017-10-10 00:03:51] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:18:29] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:19:01] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:19:37] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:19:44] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:19:45] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:19:57] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:20:18] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:20:33] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:20:46] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:20:47] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:24:39] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:24:49] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"
[2017-10-11 15:24:54] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occured in driver: SQLSTATE[42000] [1226] User 's53003' has exceeded the 'max_user_connections' resource (current value: 10)"

The first log entry was the only time it happened on the October 10. The ones for October 11 as you can see are all within a brief time periods, and we haven't had any errors yet today (Oct 12).

So fortunately this seems rare (comparatively), but obviously not ideal. As of October 8, everything is running synchronously, so each tool should consume no more than one connection per request. This is no different than the old XTools, except that the old one connected to enwiki.labsdb, so maybe it had a higher connection limit?

The user s53003 had no connection limit on the old servers indeed.
I have increased it from 10 to 20 on .web.
Let me know how it goes.

Thanks! I should also mention our query killer apparently wasn't working... pretty sure that was partly at fault for this. It's working now, though! https://xtools.wmflabs.org/killed_slow_queries.txt E.g. someone was trying to run the Edit Counter on User:ClueBot NG, who has 4.6 million edits... Denied! =P

Let's leave this open for the weekend and we can close it on Monday or Tuesday if all is fine, sounds good?

Let's leave this open for the weekend and we can close it on Monday or Tuesday if all is fine, sounds good?

Works for me!

No errors since October 12. I think we can close this. Thank you again for your assistance! :)

jcrespo assigned this task to Marostegui.