Page MenuHomePhabricator

s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues
Closed, ResolvedPublic

Description

These queries were coming in at high frequency recently. Because this appears to be causing trouble with locking on the server (and even might be related to service outages over the past couple days) the account has been locked for now on toolsdb.

Please review code for why this might be happening and fix it.

Event Timeline

Bstorm created this task.Feb 15 2019, 12:54 AM
Bstorm triaged this task as High priority.

That can be a consecuence and not really the cause. If the server is too overloaded, it might not be able to create that and the code might be retrying, and as we don't have a per user limit...
Both things should be probably fixed 1) code 2) establish a per user limit.

Hi, thanks for filing this. Here's some background I have:

  • This is for my tool Citation Hunt (https://github.com/eggpi/citationhunt), which is basically a set of cron jobs populating a couple MySQL database and a server that serves off that database.
  • One of the databases is s52481__stats_global and both the scripts and the cron jobs may do the CREATE DATABASE request; the server does it at every request if I recall correctly, and the scripts should do it once per execution.
  • I moved the script to run in the Stretch job grid last week. Ever since I did that, one of the cron jobs has been failing repeatedly for enwiki and I launched it a few times last night to troubleshoot. This may-or-may-not be related, and I'll file a separate ticket if not, but I suppose it's possible that my retries caused some damage.
  • You are correct that there is a code path that retries the CREATE DATABASE statement if it fails (https://github.com/eggpi/citationhunt/blob/master/chdb.py#L147), up to 5 times. I should definitely have some backoff in there and log the MySQL errors.

That said, I have a couple of questions:

  • "These queries were coming in at high frequency recently." do you have more specific information? What frequency did you see and when was that? Do you know by any chance what was the grid job that was performing the repeated queries?
  • Other than the issues I mentioned above, anything else I should know or do in the code?
  • Since the tool has been deactivated for a couple of hours now: are you confident that it was the cause for the outages?

I don't know what was the situation yesterday night, as I wasn't present during the troubleshooting - however, we had pretty much the same issue at around 6AM UTC today, and from what I could see, your user wasn't among the ones creating issues (see T216208#4956626)

CREATE DATABASE IF NOT EXISTS cannot go thru if there is a session with LOCK TABLES (https://dev.mysql.com/doc/refman/5.7/en/create-database.html) - I don't know if this was the case yesterday as I wasn't present. But it could be the case that all your re-tries were just piling up because of the general degradation of the server. Certainly, doing retries doesn't really help.

@Surlycyborg The database account for the tool was locked last night pending contact with you (and in case this was actually the cause of the problems with table locking). At this point, the same problem is back that leads to the overload (where no queries ever finish), so apparently our problem is entirely based on the server and not any user activity.

We are struggling to restore services (at this point by standing up a new server).

It is probably best to avoid doing a create database statement regularly and instead do it differently (checking in other code and then running the create database if the db is missing, perhaps).

I can restore your account shortly.

I believe I correctly enabled the DB account @Surlycyborg. Please change the code so that it does not retry creating the database. If that doesn't work for some reason, put in a Phabricator task and get us moving to fix it instead. We want to know if it's messed up :)

Also, unfortunately, because your code was not the root cause of the outage in any way (though repeated locks were not helping matters), the problem is back, and I suspect just about anything you do against the database will fail (as it is for everyone while we try to get it working).

It might be best to disable runs of anything that tries DDL commands (creating databases or altering tables) until the dust settles. It can only make things worse.

OK, so this morning I'd already disabled the batch jobs that make the heaviest use of the database, and we can definitely survive without them until Toolforge is fully recovered. However, the tool itself is still down (https://tools.wmflabs.org/citationhunt) and I still don't seem to have access to the database:

$ mosh login-stretch.tools.wmflabs.org
ggp@tools-sgebastion-07:~$ become citationhunt
tools.citationhunt@tools-sgebastion-07:~$ sql tools
ERROR 1045 (28000): Access denied for user 's52481'@'172.16.7.167' (using password: YES)

Is there anything I should do to restore it?

As for changes to the tool, I've just filed two issues to track not repeating database setup commands, and retrying more nicely when an exception happens:

My understanding is that, while they're definitely nice to have, they are not blockers for restoring the tool, but please correct me if I'm wrong.

Ah if the password doesn't work I can reset the password again properly.

Additionally, I will say that the database might not work for you anyway today. It's in a terrible state. It is also likely that after some recovery activities, we may need to reset it again because some replication could be lost.

Efforts to fix things are still proceeding.

Thanks! Yes, sql tools does work now. Good luck with the rest of the recovery!

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Feb 19 2019, 1:08 AM
Bstorm lowered the priority of this task from High to Normal.Feb 21 2019, 6:40 PM

@Surlycyborg were you able to remove automated database creation commands from the mix? I think for the overall health of things, it would be a good idea to avoid that even if it wasn't the likely root cause here in the end.

Sorry, I haven't got around to it yet. Hope to get to it in the next few days though, it should be tracked in the GitHub issues I mentioned above.

@Bstorm We should consider removing CREATE grant from the user until the issue is fixed.

If there is still impact to Toolforge and reasonable evidence that this tool is responsible, please feel free to disable it. Otherwise, that seems unwarranted. As I mentioned above, it makes a lot of sense to remove the rendundant CREATEs and I hope to do it over the weekend, but it doesn't sound like we should be treating this as an urgent issue at this stage (fwiw the tool has had these extra statements for the past 4 years).

OK, I've just deployed the fixes for the two issues I mentioned above: nicer retries and removing the extra CREATE commands from the serving path.

Surlycyborg closed this task as Resolved.Feb 25 2019, 8:09 PM