Page MenuHomePhabricator

Establish a process for increasing a toolforge tool's connections to the wiki replicas
Closed, ResolvedPublic

Description

We need to set some kind of process or standard for increasing a tool's database connection pool when it is warranted and necessary. This was done for the Quarry application on T180141: Raise concurrent mysql connection limit for Quarry (or throttle application concurrency), and right now, petscan is a candidate for that.

This is just to build some agreement and encourage discussion on that.

Event Timeline

Bstorm created this task.

It seems clear that a Phabricator task requesting the increase and describing the need with a review would be a sensible part of things. I imagine the review should include WMCS and the DBA team, perhaps on a work board like what we have for https://phabricator.wikimedia.org/project/view/2880/

It seems clear that a Phabricator task requesting the increase and describing the need with a review would be a sensible part of things. I imagine the review should include WMCS and the DBA team, perhaps on a work board like what we have for https://phabricator.wikimedia.org/project/view/2880/

Agreed that a well documented process and a transparent workflow for triaging and responding to requests is needed. Following the model of project milestone used by Cloud-VPS (Project-requests) and Cloud-VPS (Quota-requests) for tracking the requests and their outcomes seems like a reasonable approach. Let the bike shedding commence on naming! Data-Services (Quota-requests) seems reasonable to me, but other ideas are welcome.

To add another use case (and to ping the issue):

In addition to the API scripts, Mix'n'match uses a lot of background scripts, many of which can be triggered by users. This can easily lead to situations where, temporarily, more DB connections are required than the default 10. Most of these scripts run only a few minutes, so increasing the number of DB connections does not mean "permanent saturation".

There could also be a mechanism that, for all tools, allows more than 10 connections for a limited time (say, 10 minutes), if such a thing is technically possible.

Bursting the connection limit like that isn't directly possible within the database system, as I understand. I can poke around where that is possible.
Overall, I think the Data-Services (Quota-requests) name sounds good to me.

I will suggest that we should make sure we record any approved limit increases like this similar to the record in puppet for Quarry so that we can find it later, along with the task number. That way, if the account must be reconstructed (because it was deleted) or has an issue, there is an easy way to verify what the approved quota was. This is especially true if we needed to recycle the auth credentials for an account on the replicas. The script will set the connection limit back to 10 for the account.

@Bstorm Data-Services (Quota-requests) exists and is ready for the next steps of documenting the process there and integrating it into the WMCS weekly triage process.

Ok, so I've added the mechanical HowTo doc here https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#User_connection_limits

I need to be added to the project @bd808. I cannot manipulate anything there and would like to set up a form :)

@Marostegui would you want your group involved in the approval process or is what I suggested at https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#User_connection_limits sufficient? Like should we require a +1 on the ticket at least from a DBA?

I need to be added to the project @bd808. I cannot manipulate anything there and would like to set up a form :)

Hmmm... Data-Services (Quota-requests) is a milestone of Data-Services so it does not have a local ACL. The parent project's edit ACL is set to all users. @Bstorm what do you see at https://phabricator.wikimedia.org/project/edit/4481/?

I see that I can edit it! Huh. I tried to join the project that that was greyed out--but apparently that's a different bit of phab trivia:

Screen Shot 2020-03-31 at 4.44.11 PM.png (492×1 px, 64 KB)

Ok, I'll start editing that thing tomorrow.

Ok, so I've added the mechanical HowTo doc here https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#User_connection_limits

I need to be added to the project @bd808. I cannot manipulate anything there and would like to set up a form :)

@Marostegui would you want your group involved in the approval process or is what I suggested at https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#User_connection_limits sufficient? Like should we require a +1 on the ticket at least from a DBA?

Thank you Brooke!
I would like to be included yeah, basically to be aware of that. Just in case, if we have problems and I am aware which users might have had their limits increased, that can ring a bell whilst troubleshooting.

Thank you again!

Ok, I took a stab at largely copying and adapting the procedure for Cloud-VPS (Quota-requests) and set up a draft for Data-Services (Quota-requests). Apropos to the description I've scribbled in there, I added a DBA Approval column to the work board.

@Marostegui, @bd808 and everybody else, what do you think?

Ok, I took a stab at largely copying and adapting the procedure for Cloud-VPS (Quota-requests) and set up a draft for Data-Services (Quota-requests). Apropos to the description I've scribbled in there, I added a DBA Approval column to the work board.

@Marostegui, @bd808 and everybody else, what do you think?

Looks like a solid process to start from @Bstorm. Thanks for keeping this idea alive.

Thanks for working on this @Bstorm!
The description looks good.
There's one thing I would like to get added if you guys agree. I believe we should include there that any approved request can be reverted at anytime if the service requires it in order to maintain availability, ie: sudden general load increase, service degradation etc.
Also, we might want to add a clause saying that any misbehaviour from the given tool will result in either blocking it or roll it back to the default amount of allowed connections.

Finally circled back and added that information! What do folks think now?

It looks like we have a customer. I'll see if I can get a request started on the process for petscan as well.

It looks like we have a customer. I'll see if I can get a request started on the process for petscan as well.

I'll start with a document on wikitech of the steps, actually.

I didn't scroll up, and it has been long enough that I forgot I already did that https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#User_connection_limits

Ok, so let's see about trying this out.

@bd808 has added the item to our weekly meeting to review these, so I think we are up and moving now.

Bstorm claimed this task.

Process now officially in use! Closing this issue.