Page MenuHomePhabricator

Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework
Closed, ResolvedPublicFeature

Description

Feature summary (what you would like to be able to do and where):
Add a --log-dir parameter to the toolforge-jobs run command which would specify the path of a directory in which the user wants the log files associated with the invoked job to be stored in.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
With the grid engine, we have this option; a user can submit a job like jsub -N "somename" -o ./err/somename.out -e ./err/somename.err ./jobs/somename.sh.

If someone is really confident, they can even redirect the job output to /dev/null so as to not waste any disk space and need to do no cleanup later on.

Benefits (why should this be implemented?):
This helps users keep their tool's home directory from getting cluttered by tons of log files.

This also helps the new toolforge job framework to more closely resemble the grid engine, which will help with the transition.

Related Objects

Event Timeline

Huji renamed this task from Allow specifying the location of log fiels for toolforge job framework to Allow specifying the location of log files for toolforge job framework.Feb 17 2022, 9:22 PM

@bd808 how difficult is this to implement? Because I am increasingly feeling the need for this.

Huji renamed this task from Allow specifying the location of log files for toolforge job framework to Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework.May 26 2022, 1:18 PM

@bd808 how difficult is this to implement? Because I am increasingly feeling the need for this.

Not impossible, but also not trivial. The existing logging is implemented by appending redirects for stdout and stderr to files after the job's command spec (basically ${job} 1>>${jobname}.out 2>>${jobname}.err). The new option to set a different output file/directory would need to be added to the cli script, and then the larger work of adding the option to the api service and using it to control the output redirection would follow.

A decision needs to be made on the level of control given to the user: do we want to mirror the direct and individual control of the stdout and stderr log files exposed by jsub, or do we want some other model on the cli frontend? By that I mean do we want something like --out full/path/to/file/including/extension or something like --log-dir full/path/to/dir. When not provided explicitly, what are the defaults for the new setting(s)? Would attempting to use --no-filelog and the new output file control at the same time be an error in the cli? Would it be an error in the backend if it made it past the cli?

@aborrero has been acting as both the tech lead and product manager for making these sorts of decisions about toolforge-jobs to this point in the implementation. He is currently out on extended leave which has left this project without a decider and also without a coder in the immediate future. I'm confident that we can find hands to fix critical bugs in the service if they come up, but right now we don't have extra folks to hop over and work on feature requests like this one.

If someone would like to step up and propose concrete requirements and design constraints as well as implement them following discussion that would be welcome as well. I can't guarantee the speed of engagement with code review and deployment because again we are short staffed at the moment (not always I promise!).

I think I can help with coding myself; I was able to find the two places (in the api and the cli Python codes) where changes need to be made.

The simplest solution, based on current code, is this, IMHO:

  • User can only specify path relative to their home dir (so --log-dir foo/bar will result in the error and output files to be stored at ~/foo/bar/jobname.err and ~/foo/bar/jobname.out, respectively).
  • Not specifying a --log-dir will result in the two files being saved within tool's home directory (current behavior).
  • Python script for API can be easily modified to first check the existence of the relative path provided by the user.
  • Update on-wiki documentation to state that --no-filelog supercedes --logdir meaning if you specify both, you still don't get any log files stored.
  • All other issues (such as not having write permission in the proposed path) are no different than current state, so we won't address it as part of this task. That means, I could also chmod my tool's home to make it unwritable and use the cli as is, and presumably it'll cause some error message somewhere, and so will be the case after we make the changes proposed above.

All of the above could translate to a change as simple as ${job} 1>>${logdir}${jobname}.out 2>>${logdir}${jobname}.err where the default value of logdir is blank. (Obviously, we will remind users that their --log-dir value must include the trailing / which is why we are not hard-coding it in this changed version.)

Any more nuanced solutions (like allowing a different path for the err versus out file) could be discussed separately, preferably after @aborrero is back.

Change 800864 had a related patch set uploaded (by Huji; author: Huji):

[cloud/toolforge/jobs-framework-cli@master] Allow specifying the path for log files

https://gerrit.wikimedia.org/r/800864

Change 800865 had a related patch set uploaded (by Huji; author: Huji):

[cloud/toolforge/jobs-framework-api@main] Allow specifying the path for log files

https://gerrit.wikimedia.org/r/800865

Those two patches should provide a general sense of what I mean.

Obviously, I cannot test these (I have no idea how!) so they would need a bit of love from someone who knows how to.

All of the above could translate to a change as simple as ${job} 1>>${logdir}${jobname}.out 2>>${logdir}${jobname}.err where the default value of logdir is blank. (Obviously, we will remind users that their --log-dir value must include the trailing / which is why we are not hard-coding it in this changed version.)

Any more nuanced solutions (like allowing a different path for the err versus out file) could be discussed separately, preferably after @aborrero is back.

These all seem like reasonable assumptions and trade offs.

Those two patches should provide a general sense of what I mean.

Obviously, I cannot test these (I have no idea how!) so they would need a bit of love from someone who knows how to.

Unfortunately I don't really know how to test this either, but I have added @dcaro and @Majavah to the patches in the hope that one or both of them already know or have time to learn how to test these changes.

Change 802589 had a related patch set uploaded (by Huji; author: Huji):

[cloud/toolforge/jobs-framework-api@main] Improve the way in which commands from k8s objects are sliced

https://gerrit.wikimedia.org/r/802589

Change 802589 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] Improve the way in which commands from k8s objects are sliced

https://gerrit.wikimedia.org/r/802589

Change 827592 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --stdout and --stderr args to cli

https://gerrit.wikimedia.org/r/827592

Change 827594 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --stdout and --stderr args to cli

https://gerrit.wikimedia.org/r/827594

My proposal is that we leave the current filelog option as is. I think the investment that will really benefit us is trying to work on the root problem: T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots

There are a few things to consider here:

  1. how to deal with relative vs absolute paths
  2. how to deal with user specifying a directory vs a file
  3. how to store & parse the log information in the kubernetes resource

Regarding #1 I kind of like the idea that @dcaro wrote: resolve filelog path relative to the CWD of the command. And reject absolute paths that doesn't start with /data/project/<user>/
This is {command_CWD}/{logfile}

Example:

command: ./mycommand.sh
filelog by user: log/myfile.log
filelog in k8s: ./log/myfile.log

command: mydir/command.sh
filelog by user: myfile.log
filelog in k8s: mydir/myfile.log

command: mydir/../command.sh
filelog by user: ./otherdir/../logs/mylog.log
filelog in k8s: mydir/../otherdir/../logs/mylog.log

Regarding #2 I think we should not make any assumptions, go the easy way and always expect a file. Let the shell report any problems the user config may create. And still, leverage the setup in #1 regarding paths.

Example:

command:  ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a directory, the shell will complain and the job will fail]

command: ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a file, then the logs will be stored there ]

Regarding #3, As of this writing, we are calling the user-defined job command with a sh wrapper, like this:

sh
-c
--
./mycommand.sh --arguments 1>/dev/null 2>/dev/null

I think we can have the setup 'a bit' more elegant by placing the redirections outside the job call, example:

sh
-c
1>>/dev/null
2>>/dev/null
--
./mycommand.sh --arguments

Changing this is challenging because there are plenty of jobs defined in the cluster and we may need to remain compatible until old ones are gone. Or bump the API version :-P which doesn't magically solve the compatibility either...

Regarding #1 I kind of like the idea that @dcaro wrote: resolve filelog path relative to the CWD of the command. And reject absolute paths that doesn't start with /data/project/<user>/
This is {command_CWD}/{logfile}

On second thoughts, I think resolving relative paths relative to the $HOME is more robust for future usages. In the future we aim to have some way to interact with the API via a web interface where there may or may not be a concept of different work directories.

Regarding #2 I think we should not make any assumptions, go the easy way and always expect a file. Let the shell report any problems the user config may create. And still, leverage the setup in #1 regarding paths.

Example:

command:  ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a directory, the shell will complain and the job will fail]

command: ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a file, then the logs will be stored there ]

@aborrero , thanks for making out time to talk about this. I have a little question, if we always expect a file, then we might have to decide on what to do with this task. closing it is probably the best given the plan to always expect a file.

@aborrero , thanks for making out time to talk about this. I have a little question, if we always expect a file, then we might have to decide on what to do with this task. closing it is probably the best given the plan to always expect a file.

My idea so far has been that these 3 feature requests are pretty similar and can be solved with the same changes to the API:

If we do it right, users can do all 3 with the new semantic we're introducing.

@aborrero It's still a but confusing, recommendation #2 is in direct conflict with the idea behind this ticket. I sent an invite for monday 12pm - 12:30pm utc time, so we can talk about this a bit more if you are interested? will you be available at that time or do we need to reschedule another time maybe?

$ toolforge-jobs run refreshlinks--k8s --command "php ./php/refreshlinks.php > ./logs/refreshlinks--k8s.log" --image tf-php74 --no-filelog

This command created the new file refreshlinks--k8s.log in my /logs directory but left it just a null file rather than writing STDOUT to it.

$ toolforge-jobs run refreshlinks--k8s --command "php ./php/refreshlinks.php > ./logs/refreshlinks-k8s.log" --image tf-php74

Likewise this command created the new file refreshlinks-k8s.log in my /logs directory but left it just a null file and wrote STDOUT to refreshlinks--k8s.out in my tool home directory.

Why is toolforge-jobs run not respecting standard shell command output redirection other than creating a null file at the 1> target of the redirection?

Why is toolforge-jobs run not respecting standard shell command output redirection other than creating a null file at the 1> target of the redirection?

toolforge-jobs does its own output redirection by appending to the end of the specified command. (ref)

Per the documentation:

Subsequent same-name job runs will append to the same files. NOTE: as of this writing there is no automatic way to prune log files, so tool users must take care of such files growing too large.

Supporting standard shell command output redirection will avoid this issue.
>> syntax says to append to the named log file, while
> says to supersede the contents of the previous file.

Using > allows me to fully automate my bot by avoiding the need for babysitting to "take care" of my log file growing too large.

I see >> in the code that JJMC89 linked to in the previous post in this thread, which is forcing this need for tending to our logs.

Change 871171 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] command: introduce filelog_std{out,err} parameters

https://gerrit.wikimedia.org/r/871171

aborrero changed the task status from Open to In Progress.Dec 23 2022, 5:41 PM
aborrero triaged this task as Medium priority.

Some of us (specially @Raymond_Ndibe and I ) we've been working on this feature. But we found several blockers and challenges, let me try capturing here what happened in the last couple of weeks.

The original feature patch was written by Raymond https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/827594 but it never got merged.

I saw the need for a previous refactor to better accommodate for the feature in a more elegant and robust way, basically abstracting the whole job command into a separated file/class.

Raymond detected problems with the part 2, reported in T325755: Bug: jobs-framework-api job run fails silently on local development environment. The problems were not trivial to identify or reproduce, mainly because:

  • we don't have a deterministic and reproducible way of creating a local kubernetes Toolforge deployment.
  • the way we deploy jobs-api is a bit fragile (raw kubernetes YAMLs) and can create cruft and config drift in the different environments (local, tools, toolsbeta, ...)
  • because of the above the bug can be hard to trigger in a way that it is clear the bug is in effect and not other misconfiguration (for example, permissions, PSPs and such)

To try mitigating these problems, I did:

Finally, it was detected the root of the problem to be on the way we were redirecting the stdout/stderr, and we had to revert the pre-refactor https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/871202 to don't leave the repository in a known bad shape. Some bits of the patch have been rescued in https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/871171 anyway

We invested half an afternoon researching how to better structure the file redirections and we reached a couple of valid solutions:

array = ["sh", "-c", "1>>stdout 2>>stderr ./cmd.sh" ]
array = ["sh", "-c", "./cmd.sh 1>>stdout 2>>stderr" ]
array = ["sh", "-c", "exec 1>>stdout; exec 2>>stderr; {user_cmd}" ]

But as of this writing it is undecided which layout to use. We will continue with this after the end-of-year break.

Change 871171 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] command: introduce filelog_std{out,err} parameters

https://gerrit.wikimedia.org/r/871171

Change 881409 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] loaded: make it understand filelog, filelog_stdout, filelog_stderr

https://gerrit.wikimedia.org/r/881409

Change 800864 abandoned by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] Allow specifying the path for log files

Reason:

included in https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-cli/+/881409

https://gerrit.wikimedia.org/r/800864

Change 881409 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] loader: make it understand filelog, filelog_stdout, filelog_stderr

https://gerrit.wikimedia.org/r/881409

Change 800865 abandoned by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-api@main] Allow specifying the path for log files

Reason:

a different approach (but equivalent) is already in place.

https://gerrit.wikimedia.org/r/800865