Page MenuHomePhabricator

Unthrottle faebot
Closed, ResolvedPublic

Description

faebot was throttled due to T119604: Faebot is crashing labsdb1002

In T119604, @Fae wrote:

Back in November I added '-once' to the BLP report that was behind the multiple jobs (it runs every 5 minutes, but with the 'once' command can only create one job at any time).

The backlog of jobs was an unexpected problem as the report had been running in a stable state for a year before the job-glitching, no doubt itself caused by a problem elsewhere. > The report did seem to be taking a ridiculously long time to complete (like 40 times longer than it used to). Possibly this is down to a database change I am unaware of, but that can > be sorted out with a separated discussion on an email list.

Could the throttle please be removed or changed? As a result of the throttle the BLP report which was being used by several administrators as an important vandal monitoring report, has not been updated for months. If anyone has further tips to avoid a future potential jobs backlog problem, I would be happy to make further changes to the crontab etc.

As an example of the impact of the throttling, the "thanks report" https://meta.wikimedia.org/wiki/User:Faebot/thanks has not been updated since its last run in August 2015. There was going to be a Signpost article about the thanks feature based on this information, but that's a non-starter now.

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added projects: DBA, Toolforge.
valhallasw added subscribers: valhallasw, Fae, jcrespo, yuvipanda.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

@valhallasw I told @Fae to reopen that ticket, so if anyone's, it is my fault for reopening that. I apologize (I was actually monitoring that).

BTW, this will have to wait, because there is another process crashing labsdb hosts, and I have first to investigate that.

As we discussed by email, I am still waiting for a reason to unthrottle the account, and what measures have been taken to avoid another OOM.

To clarify, I am not asking why you need more than one connection. I know and can understand that. I am asking what has been done to limit the concurrency (e.g. I have added a limit to no more than 5 concurrent queries at the same time/I have limited the query execution to 300 seconds).

I thought that was covered by explaining on Phab how use of '-once' makes multiple jobs impossible for the BLP report. As multiple jobs are impossible, there is no reason to expect that this type of job could create a job backlog. All the other Faebot jobs run just a couple of times a day and cannot pose any sort of risk for a future outage, while the BLP report was running every 5 minutes as it was a popular vandalism prevention tool.

No alternative has been suggested as a helpful change from my side apart from this one that I made in November 2015, though presumably work has gone on in the background better to identify and manage unexpectedly large job queue backlogs.

So, how many concurrent connections do you need?

jcrespo triaged this task as Medium priority.Jan 18 2016, 5:53 PM
jcrespo moved this task from Backlog to In progress on the DBA board.

Probably 5 connections is sufficient. Some of my reports rely on more that one database link at the same time to make the SQL, though most are one at a time even when multiple queries are made for the report.

I'll keep an eye on error logs in case some reports drop out due to clashes with each other, though I could probably fix that by jiggling with crontab.

jcrespo claimed this task.

I've granted you 5 concurrent connections. I will be monitoring memory usage per user, if it grows close to crashing a server, I will throttle it again and kill your long running queries. If it keeps being stable, I will remove any limit. Thank you.