Page MenuHomePhabricator

Add a slow query killer to new XTools
Closed, ResolvedPublic3 Estimated Story Points

Description

We should have some kind of watchdog script that kills queries that take longer than a certain amount of time (since we're running on our own limited VM rather than on Tool Labs).

Event Timeline

There's a query killer bash script for the old XTools that we might be able to use:

timeout=120
logfile="query-guard.log"
dbhosts=(s1 s2 s3 s4 s5 s6 s7)
#dbhosts=(s1 s3 s6 s7)
querymarker="%%"

sqlexec(){
mysql --defaults-file="${HOME}"/replica.my.cnf -A -n -s -h "$1" -D '' -e "$2"
}

for host in ${dbhosts[@]}; do

  queryfind="
        SELECT id from INFORMATION_SCHEMA.PROCESSLIST
        WHERE user = substring_index((Select current_user),'@',1)
          AND info IS NOT null
          AND info like '${querymarker}'
          AND time > '${timeout}'
    ;"
  IDS=$(sqlexec "${host}.labsdb" "${queryfind}" )
  echo $host
  for ID in $IDS; do
    sqlexec "${host}.labsdb" "SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE id = '${ID}'; " >> ${logfile}
    sqlexec "${host}.labsdb" "kill ${ID} ;"
    printf "$(date)  killed: ${ID} \n\n" >> ${logfile}
  done

done

(since we're running on our own limited VM rather than on Tool Labs)

You probably have more resources on your VM than you would have had in Toolforge, but +1 for being nice to the database regardless.

@bd808: Any suggestions?

I'm pretty sure that Quarry has something for doing this. Maybe take a look in its source to see if it is obvious and portable to your PHP project? If you get stuck or want a review I would suggest pinging our fine DBA team for ideas.

It would be neat actually if we could offer a general purpose solution to this common problem for tools and other projects that want a lower threshold than the limit that is enforced by the database servers themselves. That would probably take the form of some sort of script that we provision on all Toolforge and VPS hosts that knows how to read the user credentials from $HOME/replica.my.cnf and do something similar to the script shared by @MusikAnimal. It could probably be written in python and reuse a single db connection per-host to be a bit lighter weight than that bash script.

kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.
kaldari set the point value for this task to 3.
Samwilson subscribed.

pt-kill is now active on the prod and staging servers. It's set up to run every minute and kill queries running longer than 120 seconds. It's logging what it kills.

The prod one is running under the prod xtools DB user, which means it'll also kill long-running queries for old xtools. That should be okay shouldn't it (or even desirable)?

Documentation updated https://wikitech.wikimedia.org/wiki/Tool:XTools .

@Samwilson: Where is the query killer log?

https://xtools.wmflabs.org/killed_slow_queries.txt and https://xtools-dev.wmflabs.org/killed_slow_queries.txt (they're both empty at the moment).

There's no logrotation on them at the moment. I don't suppose they'll fill up that fast, but there's nothing emptying them. I'll make a note of this...

You probably have a program or know how to write a log rotation script, but in case you don't this is what I use (good ole bash):

for logfile in *.out, *.log, *.whatever
do
   tail -c 100000 $logfile > temp.$$; mv temp.$$ $logfile
done