Over the past 2 weeks I discovered that two background tools of mine were broken as of November 16.
- krinklebot (Commons:Auto-protected files)
Runs every 15 minutes from a crontab:
0,15,30,45 * * * * /usr/bin/jsub -once -quiet -l release=trusty -mem 500m -N fileprotectionsync $HOME/pywiki/bin/python $HOME/src/pywiki-fileprotectionsync/fileprotectionsync.py
Sleeping for 9.1 seconds, 2016-11-16 22:18:38 Page [[Commons:Auto-protected files/wikipedia/bn]] saved Unable to initialize environment because of error: SGE_ROOT directory "/var/lib/gridengine" doesn't exist Exiting. Logging in to commons:commons as KrinkleBot@Autoprotect Sleeping for 6.9 seconds, 2016-11-16 22:45:51 Page [[Commons:Auto-protected files/wikipedia/de]] saved Sleeping for 9.4 seconds, 2016-11-16 22:45:58 Page [[Commons:Auto-protected files/wikipedia/en]] saved Sleeping for 9.5 seconds, 2016-11-16 22:46:08 Page [[Commons:Auto-protected files/wikipedia/bn]] saved [Wed Nov 16 23:00:24 2016] there is a job named 'fileprotectionsync' already active [Wed Nov 16 23:15:12 2016] there is a job named 'fileprotectionsync' already active [Wed Nov 16 23:30:14 2016] there is a job named 'fileprotectionsync' already active [Wed Nov 16 23:45:09 2016] there is a job named 'fileprotectionsync' already active [Thu Nov 17 00:00:42 2016] there is a job named 'fileprotectionsync' already active ...
When I logged in a few days ago, it hadn't been running for 3 days. There was a ghost process listed in qstat that wasn't doing anything. I suspect it somehow got stuck when there was some kind of maintenance going on.
https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Tools/SAL&oldid=1008364#2016-11-16
Then today I notice snapshots (https://tools.wmflabs.org/snapshots/) hasn't run since November 16th, either.
[Wed Nov 16 21:00:48 2016] there is a job named 'snapshots-updateSnaphots' already active [Wed Nov 16 22:00:45 2016] there is a job named 'snapshots-updateSnaphots' already active [Wed Nov 16 23:00:22 2016] there is a job named 'snapshots-updateSnaphots' already active [Thu Nov 17 00:00:37 2016] there is a job named 'snapshots-updateSnaphots' already active ... [Thu Nov 24 20:00:24 2016] there is a job named 'snapshots-updateSnaphots' already active [Thu Nov 24 21:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active [Thu Nov 24 22:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active [Thu Nov 24 23:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active [Fri Nov 25 00:00:43 2016] there is a job named 'snapshots-updateSnaphots' already active [Fri Nov 25 01:00:24 2016] there is a job named 'snapshots-updateSnaphots' already active [Fri Nov 25 02:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active
Same symptom. qstat shows that job 50919 has been stuck since November 16. After killing it, it works fine again.
I don't mind a little maintenance every now and then (when it's announced), but this isn't the first time actually (just the first time I'm reporting it). This has been happening for the past year about once a month. Would be good to know if there's something we can do to defend against this.