Page MenuHomePhabricator

Jobs failing with no logs
Open, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
My MilHistBot and AussieBit Dotnet bots are failing.

What happens?:
I get an email that reads:

We wanted to notify you about the activity of some jobs in the 'milhistbot' Toolforge tool.

  • Job 'conflicts' (cronjob) (emails: onfailure) had 1 events:
    • Pod 'conflicts-29287005-mhg47'. Phase: 'failed'. Container state: 'terminated'. Start timestamp 2025-09-07T04:46:03Z. Finish timestamp 2025-09-07T04:46:03Z. Exit code was '2'. With reason 'Error'.

If you requested 'filelog' for any of the jobs mentioned above, you may find additional information about what happened in the associated log files. Check them from Toolforge bastions as usual.

What should have happened instead?:
I do not get any automatically generated log files. The configuration is like this:

tools.milhistbot@tools-bastion-13:~$ toolforge jobs show conflicts
+---------------+------------------------------------------+
| Job name:     | conflicts                                |
+---------------+------------------------------------------+
| Command:      | web -f -n=500                            |
+---------------+------------------------------------------+
| Job type:     | schedule: 45 4 * * *                     |
+---------------+------------------------------------------+
| Image:        | tool-milhistbot/conflicts:latest         |
+---------------+------------------------------------------+
| Port:         | none                                     |
+---------------+------------------------------------------+
| File log:     | yes                                      |
+---------------+------------------------------------------+
| Output log:   | logs/conflicts.stdout.log                |
+---------------+------------------------------------------+
| Error log:    | logs/conflicts.stderr.log                |
+---------------+------------------------------------------+
| Emails:       | onfailure                                |
+---------------+------------------------------------------+
| Resources:    | mem: 0.5Gi, cpu: 0.5                     |
+---------------+------------------------------------------+
| Replicas:     |                                          |
+---------------+------------------------------------------+
| Mounts:       | all                                      |
+---------------+------------------------------------------+
| Retry:        | no                                       |
+---------------+------------------------------------------+
| Timeout:      | no                                       |
+---------------+------------------------------------------+
| Health check: | none                                     |
+---------------+------------------------------------------+
| Status:       | Last schedule time: 2025-09-07T04:45:00Z |
+---------------+------------------------------------------+
| Hints:        | No pods were created for this job.       |
+---------------+------------------------------------------+

The jobs are configured internally to write to a log file. This does not happen.
They are also configured internally to email me on errors. This does not occur either.

if I run the job manually with

toolforge jobs run --image tool-milhistbot/conflicts:latest --mount=all --command "web -f -n=100" conflicts2

It runs okay

Any advice would be greatly appreciated
Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Part of the problem is the

Output log:    logs/conflicts.stdout.log

This does not work, and likely causes the job to error.

Hmm, given the command you wrote there, it should not configure the filelogs at all, testing in one of my tools it does not:

tools.wm-lol@tools-bastion-15:~$ toolforge jobs run --image tool-milhistbot/conflicts:latest --mount=all --command "web -f -n=100" conflicts2

...

tools.wm-lol@tools-bastion-15:~$ toolforge jobs show conflicts2
+---------------+------------------------------------------------------------------+
| Job name:     | conflicts2                                                       |
+---------------+------------------------------------------------------------------+
| Command:      | web -f -n=100                                                    |
+---------------+------------------------------------------------------------------+
| Job type:     | one-off                                                          |
+---------------+------------------------------------------------------------------+
| Image:        | tool-milhistbot/conflicts:latest                                 |
+---------------+------------------------------------------------------------------+
| Port:         | none                                                             |
+---------------+------------------------------------------------------------------+
| File log:     | no                                                               |
+---------------+------------------------------------------------------------------+
| Output log:   |                                                                  |
+---------------+------------------------------------------------------------------+
| Error log:    |                                                                  |
+---------------+------------------------------------------------------------------+
| Emails:       | none                                                             |
+---------------+------------------------------------------------------------------+
| Resources:    | default                                                          |
+---------------+------------------------------------------------------------------+
| Replicas:     |                                                                  |
+---------------+------------------------------------------------------------------+
| Mounts:       | all                                                              |
+---------------+------------------------------------------------------------------+
| Retry:        | no                                                               |
+---------------+------------------------------------------------------------------+
| Timeout:      | no                                                               |
+---------------+------------------------------------------------------------------+
| Health check: | none                                                             |
+---------------+------------------------------------------------------------------+
| Status:       | Failed                                                           |
+---------------+------------------------------------------------------------------+
| Hints:        | Last run at 2025-09-29T08:14:36Z. Pod in 'Failed' phase. State   |
|               | 'terminated'. Reason 'Error'. Started at '2025-09-29T08:14:38Z'. |
|               | Finished at '2025-09-29T08:14:39Z'. Exit code '134'.             |
+---------------+------------------------------------------------------------------+

Logging into your tool I don't see the filelogs configured either in the cronjob:

|   conflicts   |                                   web -f -n=500                                    |  schedule: 45 4 * * *   | tool-milhistbot/conflicts:latest  | none  |    no     |                                            |                                            | onfailure |  default   |           |   all   |   no   |    no    |     none      | Last schedule time: 2025-09-29T04:45:00Z |
...

tools.milhistbot@tools-bastion-15:~$ toolforge jobs show conflicts
+---------------+------------------------------------------+
| Job name:     | conflicts                                |
+---------------+------------------------------------------+
| Command:      | web -f -n=500                            |
+---------------+------------------------------------------+
| Job type:     | schedule: 45 4 * * *                     |
+---------------+------------------------------------------+
| Image:        | tool-milhistbot/conflicts:latest         |
+---------------+------------------------------------------+
| Port:         | none                                     |
+---------------+------------------------------------------+
| File log:     | no                                       |
+---------------+------------------------------------------+
| Output log:   |                                          |
+---------------+------------------------------------------+
| Error log:    |                                          |
+---------------+------------------------------------------+
| Emails:       | onfailure                                |
+---------------+------------------------------------------+
| Resources:    | default                                  |
+---------------+------------------------------------------+
| Replicas:     |                                          |
+---------------+------------------------------------------+
| Mounts:       | all                                      |
+---------------+------------------------------------------+
| Retry:        | no                                       |
+---------------+------------------------------------------+
| Timeout:      | no                                       |
+---------------+------------------------------------------+
| Health check: | none                                     |
+---------------+------------------------------------------+
| Status:       | Last schedule time: 2025-09-29T04:45:00Z |
+---------------+------------------------------------------+
| Hints:        | No pods were created for this job.       |
+---------------+------------------------------------------+

Did you change that?

Is it working as you expect now?

When running on my tool wm-lol, the error I get is:

tools.wm-lol@tools-bastion-15:~$ toolforge jobs logs conflicts2
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job] Password for user AussieBot not found in /workspace/heroku_output/credx.xml
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job] Unhandled exception. CredException: Password for user AussieBot not found in /workspace/heroku_output/credx.xml
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at Cred..ctor() in /Users/ram900/milhistbot-wikimedia/Cred.cs:line 152
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at Cred.<>c.<.cctor>b__30_0() in /Users/ram900/milhistbot-wikimedia/Cred.cs:line 88
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at System.Lazy`1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor)
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at System.Lazy`1.CreateValue()
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at Cred.get_Instance() in /Users/ram900/milhistbot-wikimedia/Cred.cs:line 90
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at Wikimedia.LogPage..ctor() in /Users/ram900/milhistbot-wikimedia/LogPage.cs:line 36
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at Wikimedia.Conflicts..ctor(String[] args) in /workspace/Program.cs:line 406
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job]    at Wikimedia.Conflicts.Main(String[] args) in /workspace/Program.cs:line 413
2025-09-29T08:14:39+00:00 [conflicts2-rdhbc] [job] Aborted (core dumped)

It seems to be expecting to have some credentials file (maybe I'm missing some config).

Side note, don't put any credentials inside the images you build, they are public, use envvars for that (in case that it's not using that already).

dcaro triaged this task as Medium priority.Sep 29 2025, 8:20 AM

(1) Yes, I removed the output and error logging to get the jobs working again. It does indeed seem to have been the problem. Possibly the path is not defined? Does it work for you? It worked when I ran the job manually, but not from the cron run. Error messages like the one you got would be extremely useful to me. I can run the bots from containers on my own machines, but it is not the same as the Toolforge.

(2) The userid/password credentials are indeed kept in the envvars service. I believe this is now the preferred method.

dcaro renamed this task from Dotnet bots failing with no logs to Jobs failing with no logs.Sep 30 2025, 7:56 AM

(1) Yes, I removed the output and error logging to get the jobs working again. It does indeed seem to have been the problem. Possibly the path is not defined?

Probably it was an issue with the worker it was running on getting stuck on NFS, we are working on solving those :/, stopping and starting the job might make it run in a different worker, but it's not ensured. More details on the progress in T404584: [tools,nfs,infra] Address tools NFS getting stuck with processes in D state

Things seem to be working a lot better now. Thanks for your help in getting my C# bots to run.

toolforge jobs logs

seems to be working satisfactorily too.