Page MenuHomePhabricator

Grid jobs often stuck after Tool Labs maintenance
Open, NormalPublic

Description

Over the past 2 weeks I discovered that two background tools of mine were broken as of November 16.

Runs every 15 minutes from a crontab:

0,15,30,45 * * * * /usr/bin/jsub -once -quiet -l release=trusty -mem 500m -N fileprotectionsync $HOME/pywiki/bin/python $HOME/src/pywiki-fileprotectionsync/fileprotectionsync.py
fileprotectionsync.err
Sleeping for 9.1 seconds, 2016-11-16 22:18:38
Page [[Commons:Auto-protected files/wikipedia/bn]] saved

Unable to initialize environment because of error: SGE_ROOT directory "/var/lib/gridengine" doesn't exist
Exiting.

Logging in to commons:commons as KrinkleBot@Autoprotect
Sleeping for 6.9 seconds, 2016-11-16 22:45:51
Page [[Commons:Auto-protected files/wikipedia/de]] saved
Sleeping for 9.4 seconds, 2016-11-16 22:45:58
Page [[Commons:Auto-protected files/wikipedia/en]] saved
Sleeping for 9.5 seconds, 2016-11-16 22:46:08
Page [[Commons:Auto-protected files/wikipedia/bn]] saved

[Wed Nov 16 23:00:24 2016] there is a job named 'fileprotectionsync' already active
[Wed Nov 16 23:15:12 2016] there is a job named 'fileprotectionsync' already active
[Wed Nov 16 23:30:14 2016] there is a job named 'fileprotectionsync' already active
[Wed Nov 16 23:45:09 2016] there is a job named 'fileprotectionsync' already active
[Thu Nov 17 00:00:42 2016] there is a job named 'fileprotectionsync' already active
...

When I logged in a few days ago, it hadn't been running for 3 days. There was a ghost process listed in qstat that wasn't doing anything. I suspect it somehow got stuck when there was some kind of maintenance going on.

https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Tools/SAL&oldid=1008364#2016-11-16

Then today I notice snapshots (https://tools.wmflabs.org/snapshots/) hasn't run since November 16th, either.

snapshots-updateSnaphots.err
[Wed Nov 16 21:00:48 2016] there is a job named 'snapshots-updateSnaphots' already active
[Wed Nov 16 22:00:45 2016] there is a job named 'snapshots-updateSnaphots' already active
[Wed Nov 16 23:00:22 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 17 00:00:37 2016] there is a job named 'snapshots-updateSnaphots' already active
...
[Thu Nov 24 20:00:24 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 24 21:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 24 22:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 24 23:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active
[Fri Nov 25 00:00:43 2016] there is a job named 'snapshots-updateSnaphots' already active
[Fri Nov 25 01:00:24 2016] there is a job named 'snapshots-updateSnaphots' already active
[Fri Nov 25 02:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active

Same symptom. qstat shows that job 50919 has been stuck since November 16. After killing it, it works fine again.

I don't mind a little maintenance every now and then (when it's announced), but this isn't the first time actually (just the first time I'm reporting it). This has been happening for the past year about once a month. Would be good to know if there's something we can do to defend against this.

Event Timeline

Krinkle created this task.Nov 25 2016, 2:12 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptNov 25 2016, 2:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle renamed this task from Grid job stuck from November 16 to Grid jobs often stuck after Tool Labs maintenance.Nov 25 2016, 2:13 AM

Is there a way to identify a job as "ghost process" in a generic way?

scfc triaged this task as Normal priority.Feb 16 2017, 10:40 PM
scfc moved this task from Triage to Backlog on the Toolforge board.
valerio.bozzolan added a comment.EditedJun 4 2018, 5:23 PM

Here to report the same experience, happened yesterday.

I've a job called itwiki-deletionbot that it's sent every ~10 minutes to the grid -once. This task always reports a timestamp just before exiting successfully. Yesterday it exited successfully at 2018-06-03 07:48:18 but then the grid wrongly started to report the job as active for the following ~4 hours, until we run a qdel to unlock the situation.