Page MenuHomePhabricator

toolsdb - Per-user connection limits
Closed, ResolvedPublic

Description

We have had an increase in the number of outages caused by tools opening too many connections.

The replicas have per-user connection limit but apparently toolsdb doesn't.

This task is to identify if it does or not, if the limits are reasonable and if they don't exist, add them.

Event Timeline

GTirloni created this task.Feb 14 2019, 7:22 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2019, 7:22 PM
Wurgl added a subscriber: Wurgl.Feb 14 2019, 8:31 PM

Generally a user limit is okay.

But consider Webpages which access a user database. The author of the page does not know how many users actually use it at the same time. In contrast batch jobs can be limited.

Thanks for your feedback.

bd808 added a subscriber: bd808.Feb 14 2019, 10:03 PM

The author of the page does not know how many users actually use it at the same time.

This is probably going to vary a bit from tool to tool, but all tools running on the Kubernetes cluster in Toolforge should have concurrency limits placed on them by the webservice generated config. The PHP containers for example only allow 4 concurrent fcgi workers. This is annoying for some tool maintainers, but is something that we really need to have in place to prevent a single tool from overwhelming the shared resources we have such as the Wiki Replica databases, the ToolsDB database, and the exec nodes of the various task running systems.

Connection policy is documented at:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connection_handling_policy

Those tools that, by its traffic require a high number of open connections should probably ask for its own dedicated mysql instance.

Adding per-user limits is a relatively easy task, technically speaking.

There is also an easy-to-setup-on-puppet query killer, however, that is difficult to tune for write queries.

Marostegui added a subscriber: Marostegui.EditedFeb 15 2019, 6:40 AM

Cross posting from the main track task as an emergency mitigation: T216208#4956634

Marostegui added a comment.Fri, Feb 15, 07:39
I have restarted the server with max_user_connections = 20 to try to mitigate this, the server was unusable anyways.

Mentioned in SAL (#wikimedia-operations) [2019-02-15T06:40:58Z] <marostegui> Stop puppet on labsdb1005 to leave "max_user_connections" on my.cnf - T216170 T216208

Change 490806 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] tools.my.cnf: Temporary max_user_connections

https://gerrit.wikimedia.org/r/490806

Change 490806 merged by Marostegui:
[operations/puppet@production] tools.my.cnf: Temporary max_user_connections

https://gerrit.wikimedia.org/r/490806

For what is worth, the server has looked stable for one hour now, since I enabled the global max_user_connections. It might be preventing some tools to work if they require more than 20 connections, but at least the rest of tools/users do not suffer the outage.
As per my conversation with @Bstorm this is a temporary mitigation issue to get the server under control again - if we finally want to go for per user limit, we should look at individual cases where we will need to increase the connection limit as we do with the wikireplicas.

Marostegui triaged this task as High priority.Feb 15 2019, 7:55 AM

Thank you!!!!

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Feb 19 2019, 1:09 AM
Bstorm lowered the priority of this task from High to Normal.Feb 21 2019, 6:45 PM

Now that we are out of the "outage" condition on the service and the service is running on clouddb1001, I'd like to look at puppetizing the configuration and setting it there as well (if that didn't up done somewhere already--checking soon).

Wurgl added a comment.Feb 21 2019, 6:51 PM

It is already set.

max_connections1024
max_user_connections20

Thank you for checking before me :) I'll make sure it is also in puppet in case of rebuilds and reboots.

I think I'll remove the word temporary from the comments (but leave this ticket number) and close this. So far, the limit doesn't seem problematic and does seem good to have, but it can be revisited in time as well.

Change 492024 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: Remove the word temporary from comments

https://gerrit.wikimedia.org/r/492024

Bstorm added a comment.EditedFeb 21 2019, 7:09 PM

@Marostegui I see no subscribes or triggers on a quick pass in puppet, so if I'm not wrong I can change the config with puppet without auto-reloading or puppet restarting the server, right?

Change 492024 merged by Marostegui:
[operations/puppet@production] toolsdb: Remove the word temporary from comments

https://gerrit.wikimedia.org/r/492024

@Marostegui I see no subscribes or triggers on a quick pass in puppet, so if I'm not wrong I can change the config with puppet without auto-reloading or puppet restarting the server, right?

Yes, we do not reload the server (or even change the live configuration) on a puppet change. It will just change the my.cnf but it won't change the live config.

Bstorm closed this task as Resolved.Feb 22 2019, 5:47 PM
Bstorm claimed this task.

Thanks! Closing this up.

Hi, if the lower limit is here to stay, would it make sense to make a quick announcement in labs-l, or perhaps even a note to tools that have offended the limit in the past N days? Apologies if this was done and I missed it, but I can't seem to find it. Per-user limits make perfect sense, and I'm adapting my batch jobs to it, but an email would have saved a bit of head-scratching as I saw my jobs failing.

bd808 added a comment.Feb 25 2019, 9:15 PM

Hi, if the lower limit is here to stay, would it make sense to make a quick announcement in labs-l, or perhaps even a note to tools that have offended the limit in the past N days? Apologies if this was done and I missed it, but I can't seem to find it. Per-user limits make perfect sense, and I'm adapting my batch jobs to it, but an email would have saved a bit of head-scratching as I saw my jobs failing.

Great point @Surlycyborg. I wrote this up for the cloud-announce list: https://lists.wikimedia.org/pipermail/cloud-announce/2019-February/000138.html

Did you check how many connections are typically used by Magnus tools? He needed specific configuration in the past.

bd808 added a comment.Feb 26 2019, 7:07 PM

Did you check how many connections are typically used by Magnus tools? He needed specific configuration in the past.

No, in part because we do not track concurrent connections by a given user as a long term metric, but also at the time we decided to put this change in place everyone had been kicked off of the server by a reboot. If a tool from any maintainer needs a higher limit we can try to figure out how to accommodate that. Its not going to be a simple matter of asking however since that is nearly the same as having no limits. There will need to be a strong technical reason that the tool must be allowed to operate with more concurrency.

Wurgl added a comment.Feb 26 2019, 7:13 PM

Maybe a second account in the database would be a simpler solution to double the number of connections? (just my few cents)