We should have some kind of watchdog script that kills queries that take longer than a certain amount of time (since we're running on our own limited VM rather than on Tool Labs).
Description
Event Timeline
There's a query killer bash script for the old XTools that we might be able to use:
timeout=120 logfile="query-guard.log" dbhosts=(s1 s2 s3 s4 s5 s6 s7) #dbhosts=(s1 s3 s6 s7) querymarker="%%" sqlexec(){ mysql --defaults-file="${HOME}"/replica.my.cnf -A -n -s -h "$1" -D '' -e "$2" } for host in ${dbhosts[@]}; do queryfind=" SELECT id from INFORMATION_SCHEMA.PROCESSLIST WHERE user = substring_index((Select current_user),'@',1) AND info IS NOT null AND info like '${querymarker}' AND time > '${timeout}' ;" IDS=$(sqlexec "${host}.labsdb" "${queryfind}" ) echo $host for ID in $IDS; do sqlexec "${host}.labsdb" "SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE id = '${ID}'; " >> ${logfile} sqlexec "${host}.labsdb" "kill ${ID} ;" printf "$(date) killed: ${ID} \n\n" >> ${logfile} done done
(since we're running on our own limited VM rather than on Tool Labs)
You probably have more resources on your VM than you would have had in Toolforge, but +1 for being nice to the database regardless.
I'm pretty sure that Quarry has something for doing this. Maybe take a look in its source to see if it is obvious and portable to your PHP project? If you get stuck or want a review I would suggest pinging our fine DBA team for ideas.
It would be neat actually if we could offer a general purpose solution to this common problem for tools and other projects that want a lower threshold than the limit that is enforced by the database servers themselves. That would probably take the form of some sort of script that we provision on all Toolforge and VPS hosts that knows how to read the user credentials from $HOME/replica.my.cnf and do something similar to the script shared by @MusikAnimal. It could probably be written in python and reuse a single db connection per-host to be a bit lighter weight than that bash script.
pt-kill is now active on the prod and staging servers. It's set up to run every minute and kill queries running longer than 120 seconds. It's logging what it kills.
The prod one is running under the prod xtools DB user, which means it'll also kill long-running queries for old xtools. That should be okay shouldn't it (or even desirable)?
Documentation updated https://wikitech.wikimedia.org/wiki/Tool:XTools .
https://xtools.wmflabs.org/killed_slow_queries.txt and https://xtools-dev.wmflabs.org/killed_slow_queries.txt (they're both empty at the moment).
There's no logrotation on them at the moment. I don't suppose they'll fill up that fast, but there's nothing emptying them. I'll make a note of this...
You probably have a program or know how to write a log rotation script, but in case you don't this is what I use (good ole bash):
for logfile in *.out, *.log, *.whatever do tail -c 100000 $logfile > temp.$$; mv temp.$$ $logfile done