I was looking at the logs for running maintenance scripts on the terbium replacement machine and found out that our logs are filled with what can be found in P2293.
I traced the problem to the cronjob for wikibase dispatchchanges:
php /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --max-time 1600 --batch-size 275 --dispatch-interval 25 --lock-grace-interval 200
which is supposed to run every 3 minutes; in practice, this script takes forever to run and thus we have a draining lag of execution so that we have at least 8 concurrent executions of the script, all trying to act on the database extensively.
To make things worse, a script around the time of its timeout has typically more than 500 open connections to databases.
I don't think this is acceptable by any logic, so what I propose is the following:
- Make the cronjob run with a lock file, so that no more than one instance can run in parallel
- Make the cronjob run a bit less often
- The logic behind this script should be revisited and maybe these changes should be dispatched via our jobqueue instead than via a maintenance script
In general I don't think it's sensible to allow this much connection to be opened at once, I guess this is a bug in the code.