Page MenuHomePhabricator

Toolforge jobs for milhistbot not running
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

Toolforge jobs started by the scheduler are listed as running, but are not.

What happens?:

$ toolforge jobs show aclass
+---------------+-----------------------------------------------------------------------------+
| Job name:     | aclass                                                                      |
+---------------+-----------------------------------------------------------------------------+
| Command:      | perl -I /data/project/milhistbot/bin /data/project/milhistbot/bin/aclass.pl |
+---------------+-----------------------------------------------------------------------------+
| Job type:     | schedule: 20 * * * *                                                        |
+---------------+-----------------------------------------------------------------------------+
| Image:        | perl5.32                                                                    |
+---------------+-----------------------------------------------------------------------------+
| Port:         | none                                                                        |
+---------------+-----------------------------------------------------------------------------+
| File log:     | yes                                                                         |
+---------------+-----------------------------------------------------------------------------+
| Output log:   | /data/project/milhistbot/aclass.out                                         |
+---------------+-----------------------------------------------------------------------------+
| Error log:    | /data/project/milhistbot/aclass.err                                         |
+---------------+-----------------------------------------------------------------------------+
| Emails:       | onfailure                                                                   |
+---------------+-----------------------------------------------------------------------------+
| Resources:    | mem: 0.5Gi, cpu: 0.5                                                        |
+---------------+-----------------------------------------------------------------------------+
| Replicas:     |                                                                             |
+---------------+-----------------------------------------------------------------------------+
| Mounts:       | all                                                                         |
+---------------+-----------------------------------------------------------------------------+
| Retry:        | no                                                                          |
+---------------+-----------------------------------------------------------------------------+
| Timeout:      | no                                                                          |
+---------------+-----------------------------------------------------------------------------+
| Health check: | none                                                                        |
+---------------+-----------------------------------------------------------------------------+
| Status:       | Running for 30m57s                                                          |
+---------------+-----------------------------------------------------------------------------+
| Hints:        | Run not attempted yet. Pod in 'Pending' phase.                              |
+---------------+-----------------------------------------------------------------------------+

Run in "pending" phase. (Whatever that means.)

What should have happened instead?:

Job should have started.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Reedy renamed this task from Toolforge not running to Toolforge job for milhistbot not running.Sep 23 2025, 8:26 AM

just fyi (not a solution or anything, just quick answering the side-question, will look into the task soon), "pending" usually means it's waiting to create the pod, pull the container image, and run the actual process, any time between being assigned a node and the node running the process.

Hawkeye7 renamed this task from Toolforge job for milhistbot not running to Toolforge jobs for milhistbot not running.Sep 23 2025, 8:50 PM

This was not just one job, it was all the MilHist jobs, and they were in the pending state for a very long time before they were killed by the admin or some automated process. This particular job normally takes less than one minute to run. Here it was waiting over half an hour just to start! Something is very wrong.

Since then I am able to start jobs manually again, but it still takes several minutes in the pending state before they run. So it looks like the machine is overloaded?

This is what it looks like:
tools.milhistbot@tools-bastion-15:~$ toolforge jobs list

+---------------+-------------------------+------------------------------------------+
|   Job name:   |        Job type:        |                 Status:                  |
+---------------+-------------------------+------------------------------------------+
|    aclass3    |         one-off         |           Running for 1h20m42s           |
|    aclass     |  schedule: 20 * * * *   |           Running for 1h3m34s            |
|    aclass2    |  schedule: 35 0 * * *   | Last schedule time: 2025-09-24T00:35:00Z |
| announcements |  schedule: 50 0 * * *   |            Running for 33m30s            |
|    archive    |  schedule: 10 15 1 * *  |        Waiting for scheduled time        |
|  autocheck2   |  schedule: 45 12 * * *  | Last schedule time: 2025-09-23T12:45:00Z |
|   autoclass   |  schedule: 45 2 * * *   | Last schedule time: 2025-09-23T02:45:00Z |
|  autoreport   |  schedule: 10 0 1 * *   | Last schedule time: 2025-09-01T00:10:00Z |
|    awards     |  schedule: 30 0 * * *   |            Running for 53m19s            |
|   conflicts   |  schedule: 45 4 * * *   |        Waiting for scheduled time        |
|      fac      | schedule: 05 0,12 * * * |           Running for 1h18m34s           |
|     fanmp     |  schedule: 35 0 * * *   |            Running for 48m34s            |
|      far      | schedule: 15 0,12 * * * |           Running for 1h8m34s            |
|      flc      | schedule: 25 0,12 * * * |            Running for 58m33s            |
|  membership   |  schedule: 10 1 16 * *  |        Waiting for scheduled time        |
|    reviews    | schedule: 30 0 2 */3 *  |        Waiting for scheduled time        |
|     stats     |  schedule: 10 2 7 * *   |        Waiting for scheduled time        |
|     stubs     |  schedule: 45 3 * * *   |        Waiting for scheduled time        |
|      tfa      |  schedule: 45 0 * * *   |            Running for 38m35s            |
+---------------+-------------------------+------------------------------------------+

None of these jobs takes more than a few minutes to run! All of them listed as "Running" are stuck in "pending" phase.

@Hawkeye7 The changes in the default resources were applied today, tonight you should see them back to the usual triggering delay.

I tried running the one-off job (aclass3) today, and it went from "Pending" to "Pod in 'Pending' phase. State 'waiting'. Reason 'ContainerCreating'" in less than 20 seconds and ran in 14 seconds. So well done!

taavi assigned this task to dcaro.