Page MenuHomePhabricator

fetchSuggestions opens connection to depooled database after nine hours
Closed, DeclinedPublic

Description

Today, I had to kill this script in mwmaint:

ladsgroup@mwmaint1002:~$ ps aux | grep -i commonswiki
ladsgro+   redacted  0.0  0.0   6048   888 pts/0    S+   06:23   0:00 grep -i commonswiki
cparle   redacted  0.0  0.0   6776  3416 pts/10   S+   Nov24   0:00 /bin/bash /usr/local/bin/mwscript extensions/MachineVision/maintenance/fetchSuggestions.php --wiki=commonswiki --filelist=/home/cparle/mvlist.2M.txt --priority=0
root     redacted  0.0  0.0  10196  3968 pts/10   S+   Nov24   0:00 sudo -u www-data php /srv/mediawiki-staging/multiversion/MWScript.php extensions/MachineVision/maintenance/fetchSuggestions.php --wiki=commonswiki --filelist=/home/cparle/mvlist.2M.txt --priority=0
www-data redacted  1.9  0.1 277824 126012 pts/10  S+   Nov24  50:40 php /srv/mediawiki-staging/multiversion/MWScript.php extensions/MachineVision/maintenance/fetchSuggestions.php --wiki=commonswiki --filelist=/home/cparle/mvlist.2M.txt --priority=0

because it was making connection to db1160 which got depooled nine hours earlier (and it was the only script doing this). Usually killing the connection works and the script makes connection to the another db but in here it kept creating new connection to the depooled db.

This basically blocks us from doing maintenance (sometimes urgent maintenance) on databases. Please fix this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Curious: what is blocking this ticket?

Curious: what is blocking this ticket?

My understanding is that this was caused by T277301 which is currently on hold while we wait for feedback from the users who requested it. Since the scripts for T277301 aren't being run right now, this isn't an issue until they are again. @Cparle can confirm.

CBogen moved this task from Backlog to Ready for dev on the MachineVision board.

This is no longer blocked. T277301 has been closed but we still want to fix the script in case we need to run it again. Moving back to ready for development.

Change 780742 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/MachineVision@master] Wait for replication after committing suggestions

https://gerrit.wikimedia.org/r/780742

AFAICT, nothing too much out of the ordinary is happening in this script, so I'm not sure exactly why this is happening, when it's not happening for other scripts.
I did notice this script is not waiting for replication though; maybe that's where some relevant DB connection cleanup happens?

AFAICT, nothing too much out of the ordinary is happening in this script, so I'm not sure exactly why this is happening, when it's not happening for other scripts.

Well, sadly this is an issue in all scripts T298485: MW scripts should reload the database config. A maint script doesn't update the list of replicas (to remove the depooled one) as long as it's running. In webrequests, it's not longer than a minute so that's fine but the longer it takes to run a maint script, the more problematic it becomes. We are rethinking how it works in general. See T305016: Think about rdbms reconnection logic

I did notice this script is not waiting for replication though; maybe that's where some relevant DB connection cleanup happens?

Thanks for fixing that which would prevent potential outages but that wouldn't fix this particular issue (useful to do nonetheless). The only possible solution atm is to reduce the time it takes to run the maint script (smaller batches for example). I don't know if that's possible in this one though.

Well, sadly this is an issue in all scripts T298485: MW scripts should reload the database config. A maint script doesn't update the list of replicas (to remove the depooled one) as long as it's running.

Ah too bad, I was afraid this would be the case!
I'll keep an eye on T305016 in case relevant solutions appear.

The only possible solution atm is to reduce the time it takes to run the maint script (smaller batches for example). I don't know if that's possible in this one though.

Sadly, it isn't. Most of the running time is spent waiting for external requests (to the point where it sleeps when rate limited).
All we can do for now is keep this particular issue in mind and run it with smaller batches.

Sadly, it isn't. Most of the running time is spent waiting for external requests (to the point where it sleeps when rate limited).
All we can do for now is keep this particular issue in mind and run it with smaller batches.

Yeah, just running in smaller batches SGTM.

Closing based on the above; we'll make sure to run the script in smaller batches. Thanks.

Change 780742 merged by jenkins-bot:

[mediawiki/extensions/MachineVision@master] Wait for replication after committing suggestions

https://gerrit.wikimedia.org/r/780742