Page MenuHomePhabricator

[jobs-api] loki logs take really long to appear
Closed, ResolvedPublic

Description

We missed this as the tests were not correct (got patch https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 for the tests).

Manually testing, I'm getting ~10min delay so far for some runs, and some logs were lost:

tools.automated-toolforge-tests@tools-bastion-13:~$ source ~/venv/bin/activate

(venv) tools.automated-toolforge-tests@tools-bastion-13:~$ bats_core_pkg --verbose-run --pretty --timing --recursive --setup-suite-file "/data/project/automated-toolforge-tests/toolforge-deploy/functional-tests/setup_suite.bash" "/data/project/automated-toolforge-tests/toolforge-deploy/functional-tests/tools/" --print-output-on-failure --filter  "run a simple scheduled job without filelog, and check the logs"

....
1 test, 1 failure in 198 seconds


# then manually waiting
tools.automated-toolforge-tests@tools-bastion-13:~$ while ! toolforge jobs logs test-17406; do date; sleep 1; done                                                                                                                                                                                                                                     
ERROR: Job 'test-17406' does not have any logs available
Wed Sep 10 09:48:59 UTC 2025                            
ERROR: Job 'test-17406' does not have any logs available
Wed Sep 10 09:49:01 UTC 2025    
...
ERROR: Job 'test-17406' does not have any logs available
Wed Sep 10 09:55:07 UTC 2025
2025-09-10T09:55:01+00:00 [test-17406-29291635-r2qn6] [job] extraword-test-17406

tools.automated-toolforge-tests@tools-bastion-13:~$ toolforge jobs logs test-17406
2025-09-10T09:55:01+00:00 [test-17406-29291635-r2qn6] [job] extraword-test-17406
2025-09-10T09:57:01+00:00 [test-17406-29291637-pkhbv] [job] extraword-test-17406

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
jobs-api: bump to 0.0.414-20250915172125-3b82d2c2repos/cloud/toolforge/toolforge-deploy!967group_203_bot_f4d95069bb2675e4ce1fff090c1c1620bump_jobs-apimain
loki.alloy: decrease frequency for fetching logsrepos/cloud/toolforge/toolforge-deploy!962dcaroincrease_log_fetch_frequencymain
scheduledjobs: increase the history to allow log retrievalrepos/cloud/toolforge/jobs-api!214dcaroextend_history_limitmain
functional-tests: fix log checking testsrepos/cloud/toolforge/toolforge-deploy!961dcarofix_log_testsmain
Customize query in GitLab

Event Timeline

It's actually kinda choppy, my guess is that some workers are failing to send logs:

tools.automated-toolforge-tests@tools-bastion-13:~$ toolforge jobs logs test-17406
2025-09-10T09:55:01+00:00 [test-17406-29291635-r2qn6] [job] extraword-test-17406
2025-09-10T09:57:01+00:00 [test-17406-29291637-pkhbv] [job] extraword-test-17406
2025-09-10T10:00:11+00:00 [test-17406-29291640-f8zbj] [job] extraword-test-17406
2025-09-10T10:08:01+00:00 [test-17406-29291648-cvc74] [job] extraword-test-17406

So, at least one issue I think might be happening is:

  • alloy only checks every ~10s for new files in the kubelet /var/log/containers/ directory
  • if a job takes <10s to run
    • Job starts, creates log entry in /var/log/containers/
    • Job ends, and the pod gets deleted, clearing also the logfile
  • alloy comes around, and no new files changed

Now, we have configured the ttlSecondsAfterFinish for jobs to 30s, so those should be staying for 20s more than loki would need in the worst case :/

https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/0a493ae55ee96bb74f79732ddbbdf26b18c331d0/tjf/runtimes/k8s/jobs.py#L37

We can let alloy use a 1s interval for new files (we don't have that many pods on any worker, so should be ok), but there's till the issue of the ttlSecondsAfterFinish thingie

I think I found the issue, it's the successfulJobsHistoryLimit set to 0 in the cronjob itself, as jobs linger for ~30s.

Manually setting that to 1 makes the pods linger also for ~30s extra, allowing the logs to be collected:

(venv) tools.wm-lol@tools-bastion-13:~/toolforge-deploy$ toolforge jobs logs -f test-5312
2025-09-10T14:36:02+00:00 [test-5312-29291916-bnmpn] [job] test-5312
2025-09-10T14:40:04+00:00 [test-5312-29291920-6b58r] [job] test-5312
2025-09-10T14:52:02+00:00 [test-5312-29291932-xgnvb] [job] test-5312
2025-09-10T15:02:21+00:00 [test-5312-29291941-6wwdz] [job] test-5312
2025-09-10T15:02:32+00:00 [test-5312-29291942-hfl8j] [job] test-5312
2025-09-10T15:03:11+00:00 [test-5312-29291943-k2f5k] [job] test-5312
2025-09-10T15:04:01+00:00 [test-5312-29291944-4gqn7] [job] test-5312
2025-09-10T15:08:59+00:00 [test-5312-1757516937-lwz9j] [job] test-5312
2025-09-10T15:09:11+00:00 [test-5312-1757516949-zgm5n] [job] test-5312
2025-09-10T15:10:01+00:00 [test-5312-1757516999-vskkt] [job] test-5312
2025-09-10T15:20:02+00:00 [test-5312-29291960-57qls] [job] test-5312
#### changed from 0 -> 1, now every minute we get logs
2025-09-10T15:23:02+00:00 [test-5312-29291963-ntq2v] [job] test-5312
2025-09-10T15:24:03+00:00 [test-5312-29291964-z992c] [job] test-5312
2025-09-10T15:25:07+00:00 [test-5312-29291965-ng5fm] [job] test-5312
2025-09-10T15:26:03+00:00 [test-5312-29291966-bwvnf] [job] test-5312
2025-09-10T15:27:02+00:00 [test-5312-29291967-2vzgl] [job] test-5312
dcaro changed the task status from Open to In Progress.Sep 15 2025, 1:08 PM
dcaro claimed this task.
dcaro triaged this task as Medium priority.
dcaro moved this task from Next Up to In Review on the Toolforge (Toolforge iteration 24) board.

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/967

jobs-api: bump to 0.0.414-20250915172125-3b82d2c2

This (10s) may also (at least partly) explain T402736

dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 24) board.