fetchSuggestions opens connection to depooled database after nine hours
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Nov 26 2021, 6:34 AM

Description

Today, I had to kill this script in mwmaint:

ladsgroup@mwmaint1002:~$ ps aux | grep -i commonswiki
ladsgro+   redacted  0.0  0.0   6048   888 pts/0    S+   06:23   0:00 grep -i commonswiki
cparle   redacted  0.0  0.0   6776  3416 pts/10   S+   Nov24   0:00 /bin/bash /usr/local/bin/mwscript extensions/MachineVision/maintenance/fetchSuggestions.php --wiki=commonswiki --filelist=/home/cparle/mvlist.2M.txt --priority=0
root     redacted  0.0  0.0  10196  3968 pts/10   S+   Nov24   0:00 sudo -u www-data php /srv/mediawiki-staging/multiversion/MWScript.php extensions/MachineVision/maintenance/fetchSuggestions.php --wiki=commonswiki --filelist=/home/cparle/mvlist.2M.txt --priority=0
www-data redacted  1.9  0.1 277824 126012 pts/10  S+   Nov24  50:40 php /srv/mediawiki-staging/multiversion/MWScript.php extensions/MachineVision/maintenance/fetchSuggestions.php --wiki=commonswiki --filelist=/home/cparle/mvlist.2M.txt --priority=0

because it was making connection to db1160 which got depooled nine hours earlier (and it was the only script doing this). Usually killing the connection works and the script makes connection to the another db but in here it kept creating new connection to the depooled db.

This basically blocks us from doing maintenance (sometimes urgent maintenance) on databases. Please fix this.

Details

	Subject	Repo	Branch	Lines +/-
	Wait for replication after committing suggestions	mediawiki/extensions/MachineVision	master	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Cparle	T277301 [L] Create script to add existing images on Commons from specific categories to the popular CAT queue
		Declined		matthiasmullie	T296507 fetchSuggestions opens connection to depooled database after nine hours

Event Timeline

Ladsgroup created this task.Nov 26 2021, 6:34 AM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptNov 26 2021, 6:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Kormat subscribed.Nov 26 2021, 9:58 AM

CBogen added a parent task: T277301: [L] Create script to add existing images on Commons from specific categories to the popular CAT queue.Nov 29 2021, 3:16 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.Nov 29 2021, 3:28 PM

CBogen moved this task from Incoming to Blocked on the Structured-Data-Backlog (Current Work) board.

Curious: what is blocking this ticket?

In T296507#7674718, @matthiasmullie wrote:

Curious: what is blocking this ticket?

My understanding is that this was caused by T277301 which is currently on hold while we wait for feedback from the users who requested it. Since the scripts for T277301 aren't being run right now, this isn't an issue until they are again. @Cparle can confirm.

@CBogen speaks the truth!

This is no longer blocked. T277301 has been closed but we still want to fix the script in case we need to run it again. Moving back to ready for development.

CBogen moved this task from Blocked to Ready for Development on the Structured-Data-Backlog (Current Work) board.Feb 8 2022, 4:04 PM

Cparle removed Cparle as the assignee of this task.Mar 21 2022, 10:17 AM

Change 780742 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/MachineVision@master] Wait for replication after committing suggestions

https://gerrit.wikimedia.org/r/780742

gerritbot added a project: Patch-For-Review.Apr 14 2022, 12:12 PM

AFAICT, nothing too much out of the ordinary is happening in this script, so I'm not sure exactly why this is happening, when it's not happening for other scripts.
I did notice this script is not waiting for replication though; maybe that's where some relevant DB connection cleanup happens?

matthiasmullie claimed this task.Apr 14 2022, 12:16 PM

matthiasmullie moved this task from Ready for Development to Code Review on the Structured-Data-Backlog (Current Work) board.

In T296507#7855052, @matthiasmullie wrote:

AFAICT, nothing too much out of the ordinary is happening in this script, so I'm not sure exactly why this is happening, when it's not happening for other scripts.

Well, sadly this is an issue in all scripts T298485: MW scripts should reload the database config. A maint script doesn't update the list of replicas (to remove the depooled one) as long as it's running. In webrequests, it's not longer than a minute so that's fine but the longer it takes to run a maint script, the more problematic it becomes. We are rethinking how it works in general. See T305016: Think about rdbms reconnection logic

I did notice this script is not waiting for replication though; maybe that's where some relevant DB connection cleanup happens?

Thanks for fixing that which would prevent potential outages but that wouldn't fix this particular issue (useful to do nonetheless). The only possible solution atm is to reduce the time it takes to run the maint script (smaller batches for example). I don't know if that's possible in this one though.

In T296507#7855083, @Ladsgroup wrote:

Well, sadly this is an issue in all scripts T298485: MW scripts should reload the database config. A maint script doesn't update the list of replicas (to remove the depooled one) as long as it's running.

Ah too bad, I was afraid this would be the case!
I'll keep an eye on T305016 in case relevant solutions appear.

The only possible solution atm is to reduce the time it takes to run the maint script (smaller batches for example). I don't know if that's possible in this one though.

Sadly, it isn't. Most of the running time is spent waiting for external requests (to the point where it sleeps when rate limited).
All we can do for now is keep this particular issue in mind and run it with smaller batches.

In T296507#7863818, @matthiasmullie wrote:

Sadly, it isn't. Most of the running time is spent waiting for external requests (to the point where it sleeps when rate limited).
All we can do for now is keep this particular issue in mind and run it with smaller batches.

Yeah, just running in smaller batches SGTM.

Closing based on the above; we'll make sure to run the script in smaller batches. Thanks.

CBogen mentioned this in T277301: [L] Create script to add existing images on Commons from specific categories to the popular CAT queue.May 9 2022, 4:46 PM

Change 780742 merged by jenkins-bot:

[mediawiki/extensions/MachineVision@master] Wait for replication after committing suggestions

https://gerrit.wikimedia.org/r/780742

Maintenance_bot removed a project: Patch-For-Review.Jun 24 2022, 12:30 PM

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.18; 2022-06-27).Jun 24 2022, 1:00 PM

fetchSuggestions opens connection to depooled database after nine hoursClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

fetchSuggestions opens connection to depooled database after nine hours
Closed, DeclinedPublic
Actions

Related Objects
Search...