Grid jobs often stuck after Tool Labs maintenance
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Krinkle
	Nov 25 2016, 2:12 AM

Description

Over the past 2 weeks I discovered that two background tools of mine were broken as of November 16.

krinklebot (Commons:Auto-protected files)

Runs every 15 minutes from a crontab:

0,15,30,45 * * * * /usr/bin/jsub -once -quiet -l release=trusty -mem 500m -N fileprotectionsync $HOME/pywiki/bin/python $HOME/src/pywiki-fileprotectionsync/fileprotectionsync.py

fileprotectionsync.err

Sleeping for 9.1 seconds, 2016-11-16 22:18:38
Page [[Commons:Auto-protected files/wikipedia/bn]] saved

Unable to initialize environment because of error: SGE_ROOT directory "/var/lib/gridengine" doesn't exist
Exiting.

Logging in to commons:commons as KrinkleBot@Autoprotect
Sleeping for 6.9 seconds, 2016-11-16 22:45:51
Page [[Commons:Auto-protected files/wikipedia/de]] saved
Sleeping for 9.4 seconds, 2016-11-16 22:45:58
Page [[Commons:Auto-protected files/wikipedia/en]] saved
Sleeping for 9.5 seconds, 2016-11-16 22:46:08
Page [[Commons:Auto-protected files/wikipedia/bn]] saved

[Wed Nov 16 23:00:24 2016] there is a job named 'fileprotectionsync' already active
[Wed Nov 16 23:15:12 2016] there is a job named 'fileprotectionsync' already active
[Wed Nov 16 23:30:14 2016] there is a job named 'fileprotectionsync' already active
[Wed Nov 16 23:45:09 2016] there is a job named 'fileprotectionsync' already active
[Thu Nov 17 00:00:42 2016] there is a job named 'fileprotectionsync' already active
...

When I logged in a few days ago, it hadn't been running for 3 days. There was a ghost process listed in qstat that wasn't doing anything. I suspect it somehow got stuck when there was some kind of maintenance going on.

https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Tools/SAL&oldid=1008364#2016-11-16

Then today I notice snapshots (https://tools.wmflabs.org/snapshots/) hasn't run since November 16th, either.

snapshots-updateSnaphots.err

[Wed Nov 16 21:00:48 2016] there is a job named 'snapshots-updateSnaphots' already active
[Wed Nov 16 22:00:45 2016] there is a job named 'snapshots-updateSnaphots' already active
[Wed Nov 16 23:00:22 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 17 00:00:37 2016] there is a job named 'snapshots-updateSnaphots' already active
...
[Thu Nov 24 20:00:24 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 24 21:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 24 22:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active
[Thu Nov 24 23:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active
[Fri Nov 25 00:00:43 2016] there is a job named 'snapshots-updateSnaphots' already active
[Fri Nov 25 01:00:24 2016] there is a job named 'snapshots-updateSnaphots' already active
[Fri Nov 25 02:00:25 2016] there is a job named 'snapshots-updateSnaphots' already active

Same symptom. qstat shows that job 50919 has been stuck since November 16. After killing it, it works fine again.

I don't mind a little maintenance every now and then (when it's announced), but this isn't the first time actually (just the first time I'm reporting it). This has been happening for the past year about once a month. Would be good to know if there's something we can do to defend against this.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Bstorm	T199271 Upgrade the tools gridengine system
		Declined		None	T151603 Grid jobs often stuck after Tool Labs maintenance

Event Timeline

Krinkle created this task.Nov 25 2016, 2:12 AM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptNov 25 2016, 2:12 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Krinkle renamed this task from Grid job stuck from November 16 to Grid jobs often stuck after Tool Labs maintenance.Nov 25 2016, 2:13 AM

Paladox subscribed.Nov 25 2016, 8:35 AM

Is there a way to identify a job as "ghost process" in a generic way?

scfc triaged this task as Medium priority.Feb 16 2017, 10:40 PM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.

valerio.bozzolan subscribed.Jun 4 2018, 5:02 PM

Here to report the same experience, happened yesterday.

I've a job called itwiki-deletionbot that it's sent every ~10 minutes to the grid -once. This task always reports a timestamp just before exiting successfully. Yesterday it exited successfully at 2018-06-03 07:48:18 but then the grid wrongly started to report the job as active for the following ~4 hours, until we run a qdel to unlock the situation.

• bd808 added a parent task: T199271: Upgrade the tools gridengine system.Mar 26 2019, 12:49 AM

Not actionable.

Happened again in the itwiki tool on another unrelated script. It was running with the job name itwiki-orphanizerbot-gridnamefix2.

The job was in active status for two days instead of some minutes:

[Tue Oct 20 22:00:16 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active
[... lot of entries ...]
[Thu Oct 22 23:36:03 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active
[Thu Oct 22 23:38:02 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active
[Thu Oct 22 23:42:02 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active
[Thu Oct 22 23:44:02 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active
[Thu Oct 22 23:46:02 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active
[Thu Oct 22 23:48:03 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active
[Thu Oct 22 23:44:02 2020] there is a job named 'itwiki-orphanizerbot-gridnamefix2' already active

I don't have the time know to discover if it happened during another maintenance.

valerio.bozzolan mentioned this in T279452: itWiki's orfanizzabot interrupted after Mon Mar 15 12:20:11 2021 for no reason.Apr 6 2021, 4:52 PM

Grid jobs often stuck after Tool Labs maintenanceClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Grid jobs often stuck after Tool Labs maintenance
Closed, DeclinedPublic
Actions

Related Objects
Search...