Page MenuHomePhabricator

Allow TJF job logs to go to Kuberenetes output buffer rather than disk
Closed, ResolvedPublicFeature

Description

As a tool maintainer
I want stderr/stdout from some jobs to be captured by Kubernetes
So that I can view the most recent log output using kubectl logs <pod-name> and avoid worrying about any file rotation and truncation issues for logs that are not highly useful as filesystem artifacts.

It appears that using the --no-filelog flag redirects both stdout and stderr to /dev/null. Based on the flag name I was instead expecting stdout and stderr to go to the default streams which are captured by Kubernetes. I would like to be able to specify this behavior with some combination of flag(s).

Event Timeline

bd808 changed the subtype of this task from "Task" to "Feature Request".Feb 27 2023, 11:43 PM

Originally there was a concern about those logs filling up the file systems of the worker nodes, however modern Kubernetes versions might support limiting the size of those logs.

Ephemeral logs with size + time constrained duration are exactly what I was hoping to end up with. I guess I'm surprised if we don't already have OS level config that is rotating the log output from pods. A number of tools such as Stashbot are already using this "cloud native" logging setup and just assuming that the platform protects itself from too much log output.

Change 915740 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/jobs-framework-api@main] command: Do not void log output if filelog is disabled

https://gerrit.wikimedia.org/r/915740

aborrero triaged this task as Medium priority.May 5 2023, 10:18 AM
aborrero subscribed.

I would like to acknowledge my own ignorance, and mention that in the very early days of me working on this project, I thought that container logs had anything to do with etcd and I also wanted to protect etcd from them.

I'm 100% in favor of the changes proposed in this ticket.

Change 915740 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] command: Do not void log output if filelog is disabled

https://gerrit.wikimedia.org/r/915740

Jobs created after the next jobs-api deployment will no longer discard logs when file logging is disabled.

Jobs created after the next jobs-api deployment will no longer discard logs when file logging is disabled.

Next question is what would be the right interface to expose to users to query such logs.

I don't want to write in the docs "use kubectl" so perhaps we can wrap that from the jobs-framework as well.

This is a good point. I feel like the current file-logging mode should be the default for non-buildservice images until we have a proper logging system available. Until that we do that we need some way for buildservice users to view logs, it could be just a simple bash/Python wrapper around kubectl logs, I can write one. Not sure if it should be called toolforge jobs logs or just toolforge logs.

mmm I like toolforge logs which fits well with the whole platform idea. And can be extended in the future to show whatever logs, not only jobs logs.

mmm I like toolforge logs which fits well with the whole platform idea. And can be extended in the future to show whatever logs, not only jobs logs.

And it should be "trivial" to even query existing filelogs too (from both webservices and jobs) since the path for them are well known.

This is to say, toolforge logs could work for:

  • filelogs (on home NFS)
  • pod logs (from worker filesystem)
  • log service (future stack)