Page MenuHomePhabricator

Job for tools.urbanecmbot stuck in dt state
Closed, ResolvedPublic


Hi, I just noticed that my job for tools.urbanecmbot is stuck in dt state, and does not execute:

tools.urbanecmbot@tools-sgebastion-07 ~/11bots
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
1143140 0.35234 lighttpd-u tools.urbane r     10/14/2020 21:57:42 webgrid-lighttpd@tools-sgewebg     1
1962508 0.26716 zopsAnnoun tools.urbane dt    02/21/2021 16:17:07     1
3150285 0.25043 patrolTrus tools.urbane r     02/21/2021 03:10:48 continuous@tools-sgeexec-0947.     1
3150286 0.25043 patrolAfte tools.urbane r     02/21/2021 03:10:48 continuous@tools-sgeexec-0942.     1
3159570 0.25030 patrolSand tools.urbane r     02/21/2021 07:05:18 continuous@tools-sgeexec-0935.     1

Can you help me with unstucking it, please?

Something like this was previously reported in T136508, but I'm unsure whether information provided there are still usable.

Event Timeline

Bstorm added a subscriber: Bstorm.

Apparently, something went wrong with the queuing. There were some queues in error state this morning (none were related, though), which I've cleared, and it is interesting that this was supposed to run on the host that T275411 is happening on (probably unrelated, but good to note).

It's been doing this for a couple days 02/22/2021 18:27:08| timer|tools-sgegrid-master|W|failed to deliver job 1962508.1 to queue ""

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T18:56:21Z] <bstorm> deleted job 1962508 from the grid to clear it up T275301

I do not know why the job would have been stuck, but it may actually be related to the exec node being out of sync with openstack. I'm going to depool it.

I'm basing that suspicion on wtmp begins Mon Feb 22 15:55:22 2021 in part.

Meh, it might be absolutely no correlation between these things, but it doesn't hurt. The job should be ok either way.