Page MenuHomePhabricator

Persist important toolforge k8s components logs
Closed, ResolvedPublic

Description

Since we've moved maintain-harbor to kubernetes cronjobs directly and no longer use toolforge-jobs to deploy maintain-harbor jobs, we need to persist the logs generated every time the cronjobs run. This also applies to any other current or future k8s component that we feel we need to persist the log.

Best solution will be to feed everything into logstash or something but since toolforge doesn't yet have a centralized logging solution, a good middle ground will be to write the logs to the worker's journald or a /var/logs file. Persisting the log on the worker comes with the problem of the logs being scattered across the different workers (since the cronjob pods will potentially run on different workers every time they are created), but it's better than losing the logs completely.

Event Timeline

dcaro triaged this task as High priority.Jan 7 2025, 10:47 AM
Raymond_Ndibe changed the task status from Open to In Progress.Jan 8 2025, 4:32 PM

Did you consider using successfulJobsHistoryLimit and failedJobsHistoryLimit to persist pod objects and the logs they include for some amount of time?

Did you consider using successfulJobsHistoryLimit and failedJobsHistoryLimit to persist pod objects and the logs they include for some amount of time?

This looks interesting @taavi . The only problem I can imagine is that sometimes things can go wrong for months without us knowing (there was a task the other day about harbor doing some tricky things and we needed to go back months to investigate maintain-harbor logs and ensure the problem is not from maintain-harbor). Even if we keep just the last 10 - 100 failed and successful jobs, I imagine that at some points some logs are going to start getting cut off and lost as the pods get removed. But maybe we can specify successfulJobsHistoryLimit and failedJobsHistoryLimit in addition to persisting the logs on the local nodes?

Change #1113412 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] [toolforge] persist target logs in /var/log/pods in journald

https://gerrit.wikimedia.org/r/1113412

Raymond_Ndibe renamed this task from Persist maintain-harbor logs to Persist important toolforge k8s components logs.Jan 22 2025, 7:59 AM
Raymond_Ndibe updated the task description. (Show Details)

Change #1113412 merged by David Caro:

[operations/puppet@production] [toolforge] persist target logs in /var/log/pods in journald

https://gerrit.wikimedia.org/r/1113412

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 23) board.
dcaro subscribed.

Will revisit when we decide on T97861: [toolforge,infra] Centralized logging for Toolforge infrastructure logs, this was simple enough and prevents losing relevant logs (it does not centralize them).