Maniphest T301901

Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework
Closed, ResolvedPublicFeature
Actions

Assigned To

Authored By

	Huji
	Feb 16 2022, 6:33 PM

Description

Feature summary (what you would like to be able to do and where):
Add a --log-dir parameter to the toolforge-jobs run command which would specify the path of a directory in which the user wants the log files associated with the invoked job to be stored in.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
With the grid engine, we have this option; a user can submit a job like jsub -N "somename" -o ./err/somename.out -e ./err/somename.err ./jobs/somename.sh.

If someone is really confident, they can even redirect the job output to /dev/null so as to not waste any disk space and need to do no cleanup later on.

Benefits (why should this be implemented?):
This helps users keep their tool's home directory from getting cluttered by tons of log files.

This also helps the new toolforge job framework to more closely resemble the grid engine, which will help with the transition.

Details

Subject	Repo	Branch	Lines +/-
Allow specifying the path for log files	cloud/toolforge/jobs-framework-api	main	+21 -4
loader: make it understand filelog, filelog_stdout, filelog_stderr	cloud/toolforge/jobs-framework-cli	master	+30 -4
Allow specifying the path for log files	cloud/toolforge/jobs-framework-cli	master	+13 -0
command: introduce filelog_std{out,err} parameters	cloud/toolforge/jobs-framework-api	main	+374 -37
Improve the way in which commands from k8s objects are sliced	cloud/toolforge/jobs-framework-api	main	+3 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• JHedden	T251027 "signatures" tool has failed job pods on Kubernetes cluster
Resolved		aborrero	T251917 Design the Jobs service in k8s
Resolved		aborrero	T283238 Toolforge: develop jobs-framework-api
Resolved		aborrero	T285944 Toolforge: beta phase for the new jobs framework
Resolved		• Bstorm	T126083 overhaul labstore setup [tracking]
Resolved		• GTirloni	T216988 labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):
Resolved		• Bstorm	T217993 2019-03-10: tools and NFS share cleanup (high usage)
Resolved		• Bstorm	T122508 Prevent overly-large log files
Declined		None	T286847 Add webservice flag to mount project directory read-only
Open		None	T127367 [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots
Resolved		aborrero	T327254 WMCS FY22/23 Q3: next steps in grid engine deprecation
Resolved	Feature	Raymond_Ndibe	T301901 Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework

Event Timeline

Huji created this task.Feb 16 2022, 6:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 16 2022, 6:33 PM

Huji added a parent task: T285944: Toolforge: beta phase for the new jobs framework.Feb 16 2022, 6:34 PM

PeterBowman subscribed.Feb 16 2022, 6:36 PM

taavi edited projects, added cloud-services-team (Kanban); removed Cloud-Services-Origin-Team.Feb 16 2022, 7:35 PM

Huji renamed this task from Allow specifying the location of log fiels for toolforge job framework to Allow specifying the location of log files for toolforge job framework.Feb 17 2022, 9:22 PM

PeterBowman mentioned this in T304421: Allow customizing the out/err files with toolforge-jobs.Mar 22 2022, 12:45 PM

Sakretsu subscribed.Apr 6 2022, 4:42 PM

@bd808 how difficult is this to implement? Because I am increasingly feeling the need for this.

Huji renamed this task from Allow specifying the location of log files for toolforge job framework to Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework.May 26 2022, 1:18 PM

In T301901#7960399, @Huji wrote:

@bd808 how difficult is this to implement? Because I am increasingly feeling the need for this.

Not impossible, but also not trivial. The existing logging is implemented by appending redirects for stdout and stderr to files after the job's command spec (basically ${job} 1>>${jobname}.out 2>>${jobname}.err). The new option to set a different output file/directory would need to be added to the cli script, and then the larger work of adding the option to the api service and using it to control the output redirection would follow.

A decision needs to be made on the level of control given to the user: do we want to mirror the direct and individual control of the stdout and stderr log files exposed by jsub, or do we want some other model on the cli frontend? By that I mean do we want something like --out full/path/to/file/including/extension or something like --log-dir full/path/to/dir. When not provided explicitly, what are the defaults for the new setting(s)? Would attempting to use --no-filelog and the new output file control at the same time be an error in the cli? Would it be an error in the backend if it made it past the cli?

@aborrero has been acting as both the tech lead and product manager for making these sorts of decisions about toolforge-jobs to this point in the implementation. He is currently out on extended leave which has left this project without a decider and also without a coder in the immediate future. I'm confident that we can find hands to fix critical bugs in the service if they come up, but right now we don't have extra folks to hop over and work on feature requests like this one.

If someone would like to step up and propose concrete requirements and design constraints as well as implement them following discussion that would be welcome as well. I can't guarantee the speed of engagement with code review and deployment because again we are short staffed at the moment (not always I promise!).

I think I can help with coding myself; I was able to find the two places (in the api and the cli Python codes) where changes need to be made.

The simplest solution, based on current code, is this, IMHO:

User can only specify path relative to their home dir (so --log-dir foo/bar will result in the error and output files to be stored at ~/foo/bar/jobname.err and ~/foo/bar/jobname.out, respectively).
Not specifying a --log-dir will result in the two files being saved within tool's home directory (current behavior).
Python script for API can be easily modified to first check the existence of the relative path provided by the user.
Update on-wiki documentation to state that --no-filelog supercedes --logdir meaning if you specify both, you still don't get any log files stored.
All other issues (such as not having write permission in the proposed path) are no different than current state, so we won't address it as part of this task. That means, I could also chmod my tool's home to make it unwritable and use the cli as is, and presumably it'll cause some error message somewhere, and so will be the case after we make the changes proposed above.

All of the above could translate to a change as simple as ${job} 1>>${logdir}${jobname}.out 2>>${logdir}${jobname}.err where the default value of logdir is blank. (Obviously, we will remind users that their --log-dir value must include the trailing / which is why we are not hard-coding it in this changed version.)

Any more nuanced solutions (like allowing a different path for the err versus out file) could be discussed separately, preferably after @aborrero is back.

Change 800864 had a related patch set uploaded (by Huji; author: Huji):

[cloud/toolforge/jobs-framework-cli@master] Allow specifying the path for log files

https://gerrit.wikimedia.org/r/800864

Change 800865 had a related patch set uploaded (by Huji; author: Huji):

[cloud/toolforge/jobs-framework-api@main] Allow specifying the path for log files

https://gerrit.wikimedia.org/r/800865

Those two patches should provide a general sense of what I mean.

Obviously, I cannot test these (I have no idea how!) so they would need a bit of love from someone who knows how to.

In T301901#7965352, @Huji wrote:

All of the above could translate to a change as simple as ${job} 1>>${logdir}${jobname}.out 2>>${logdir}${jobname}.err where the default value of logdir is blank. (Obviously, we will remind users that their --log-dir value must include the trailing / which is why we are not hard-coding it in this changed version.)

Any more nuanced solutions (like allowing a different path for the err versus out file) could be discussed separately, preferably after @aborrero is back.

These all seem like reasonable assumptions and trade offs.

In T301901#7965359, @Huji wrote:

Those two patches should provide a general sense of what I mean.

Obviously, I cannot test these (I have no idea how!) so they would need a bit of love from someone who knows how to.

Unfortunately I don't really know how to test this either, but I have added @dcaro and @Majavah to the patches in the hope that one or both of them already know or have time to learn how to test these changes.

bd808 edited projects, added Toolforge Jobs framework; removed Toolforge.May 31 2022, 8:02 PM

Change 802589 had a related patch set uploaded (by Huji; author: Huji):

[cloud/toolforge/jobs-framework-api@main] Improve the way in which commands from k8s objects are sliced

https://gerrit.wikimedia.org/r/802589

Change 802589 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] Improve the way in which commands from k8s objects are sliced

https://gerrit.wikimedia.org/r/802589

Raymond_Ndibe mentioned this in T302211: toolforge-jobs: merge stdout/stderr output.Aug 12 2022, 6:01 PM

Raymond_Ndibe claimed this task.Aug 29 2022, 9:09 PM

Change 827592 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-cli@master] jobs-framework-cli: add --stdout and --stderr args to cli

https://gerrit.wikimedia.org/r/827592

Change 827594 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: add --stdout and --stderr args to cli

https://gerrit.wikimedia.org/r/827594

Raymond_Ndibe added a project: User-Raymond_Ndibe.Aug 29 2022, 10:28 PM

Raymond_Ndibe moved this task from Backlog to In Review on the User-Raymond_Ndibe board.Sep 1 2022, 2:34 AM

My proposal is that we leave the current filelog option as is. I think the investment that will really benefit us is trying to work on the root problem: T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots

aborrero added a parent task: T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots.Sep 26 2022, 12:00 PM

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 27 2022, 9:30 PM

PeterBowman mentioned this in T319958: Migrate pbbot from Toolforge GridEngine to Toolforge Kubernetes.Oct 7 2022, 10:38 AM

aborrero merged a task: T304421: Allow customizing the out/err files with toolforge-jobs.Oct 28 2022, 11:37 AM

aborrero added subscribers: Daimona, Raymond_Ndibe, Shubinator, Wurgl.

There are a few things to consider here:

how to deal with relative vs absolute paths
how to deal with user specifying a directory vs a file
how to store & parse the log information in the kubernetes resource

Regarding #1 I kind of like the idea that @dcaro wrote: resolve filelog path relative to the CWD of the command. And reject absolute paths that doesn't start with /data/project/<user>/
This is {command_CWD}/{logfile}

Example:

command: ./mycommand.sh
filelog by user: log/myfile.log
filelog in k8s: ./log/myfile.log

command: mydir/command.sh
filelog by user: myfile.log
filelog in k8s: mydir/myfile.log

command: mydir/../command.sh
filelog by user: ./otherdir/../logs/mylog.log
filelog in k8s: mydir/../otherdir/../logs/mylog.log

Regarding #2 I think we should not make any assumptions, go the easy way and always expect a file. Let the shell report any problems the user config may create. And still, leverage the setup in #1 regarding paths.

Example:

command:  ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a directory, the shell will complain and the job will fail]

command: ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a file, then the logs will be stored there ]

Regarding #3, As of this writing, we are calling the user-defined job command with a sh wrapper, like this:

sh
-c
--
./mycommand.sh --arguments 1>/dev/null 2>/dev/null

I think we can have the setup 'a bit' more elegant by placing the redirections outside the job call, example:

sh
-c
1>>/dev/null
2>>/dev/null
--
./mycommand.sh --arguments

Changing this is challenging because there are plenty of jobs defined in the cluster and we may need to remain compatible until old ones are gone. Or bump the API version :-P which doesn't magically solve the compatibility either...

In T301901#8352382, @aborrero wrote:

Regarding #1 I kind of like the idea that @dcaro wrote: resolve filelog path relative to the CWD of the command. And reject absolute paths that doesn't start with /data/project/<user>/
This is {command_CWD}/{logfile}

On second thoughts, I think resolving relative paths relative to the $HOME is more robust for future usages. In the future we aim to have some way to interact with the API via a web interface where there may or may not be a concept of different work directories.

Regarding #2 I think we should not make any assumptions, go the easy way and always expect a file. Let the shell report any problems the user config may create. And still, leverage the setup in #1 regarding paths.

Example:
command:  ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a directory, the shell will complain and the job will fail]

command: ./mycommand.sh
filelog by user: logs
filelog in k8s: ./logs
[ if this is a file, then the logs will be stored there ]

@aborrero , thanks for making out time to talk about this. I have a little question, if we always expect a file, then we might have to decide on what to do with this task. closing it is probably the best given the plan to always expect a file.

In T301901#8367879, @Raymond_Ndibe wrote:

@aborrero , thanks for making out time to talk about this. I have a little question, if we always expect a file, then we might have to decide on what to do with this task. closing it is probably the best given the plan to always expect a file.

My idea so far has been that these 3 feature requests are pretty similar and can be solved with the same changes to the API:

If we do it right, users can do all 3 with the new semantic we're introducing.

@aborrero It's still a but confusing, recommendation #2 is in direct conflict with the idea behind this ticket. I sent an invite for monday 12pm - 12:30pm utc time, so we can talk about this a bit more if you are interested? will you be available at that time or do we need to reschedule another time maybe?

$ toolforge-jobs run refreshlinks--k8s --command "php ./php/refreshlinks.php > ./logs/refreshlinks--k8s.log" --image tf-php74 --no-filelog

This command created the new file refreshlinks--k8s.log in my /logs directory but left it just a null file rather than writing STDOUT to it.

$ toolforge-jobs run refreshlinks--k8s --command "php ./php/refreshlinks.php > ./logs/refreshlinks-k8s.log" --image tf-php74

Likewise this command created the new file refreshlinks-k8s.log in my /logs directory but left it just a null file and wrote STDOUT to refreshlinks--k8s.out in my tool home directory.

Why is toolforge-jobs run not respecting standard shell command output redirection other than creating a null file at the 1> target of the redirection?

In T301901#8390757, @Wbm1058 wrote:

Why is toolforge-jobs run not respecting standard shell command output redirection other than creating a null file at the 1> target of the redirection?

toolforge-jobs does its own output redirection by appending to the end of the specified command. (ref)

Per the documentation:

Subsequent same-name job runs will append to the same files. NOTE: as of this writing there is no automatic way to prune log files, so tool users must take care of such files growing too large.

Supporting standard shell command output redirection will avoid this issue.
>> syntax says to append to the named log file, while
> says to supersede the contents of the previous file.

Using > allows me to fully automate my bot by avoiding the need for babysitting to "take care" of my log file growing too large.

I see >> in the code that JJMC89 linked to in the previous post in this thread, which is forcing this need for tending to our logs.

Wbm1058 mentioned this in T319590: Migrate billsbots from Toolforge GridEngine to Toolforge Kubernetes.Nov 26 2022, 11:55 AM

Change 871171 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] command: introduce filelog_std{out,err} parameters

https://gerrit.wikimedia.org/r/871171

Some of us (specially @Raymond_Ndibe and I ) we've been working on this feature. But we found several blockers and challenges, let me try capturing here what happened in the last couple of weeks.

The original feature patch was written by Raymond https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/827594 but it never got merged.

I saw the need for a previous refactor to better accommodate for the feature in a more elegant and robust way, basically abstracting the whole job command into a separated file/class.

part 1 https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/867591
part 2 https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/868076 <-- at this point we encountered problems

Raymond detected problems with the part 2, reported in T325755: Bug: jobs-framework-api job run fails silently on local development environment. The problems were not trivial to identify or reproduce, mainly because:

we don't have a deterministic and reproducible way of creating a local kubernetes Toolforge deployment.
the way we deploy jobs-api is a bit fragile (raw kubernetes YAMLs) and can create cruft and config drift in the different environments (local, tools, toolsbeta, ...)
because of the above the bug can be hard to trigger in a way that it is clear the bug is in effect and not other misconfiguration (for example, permissions, PSPs and such)

To try mitigating these problems, I did:

a PoC to maintain a local kubernetes deployment, based in ansible, called lima-kilo, that can be found here https://gitlab.wikimedia.org/aborrero/cloud-toolforge-lima-kilo Is in the early stages, but already solved part of the problem (at least for me in my devel machine)
try moving jobs-api to helm, patch here: https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/870887/

Finally, it was detected the root of the problem to be on the way we were redirecting the stdout/stderr, and we had to revert the pre-refactor https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/871202 to don't leave the repository in a known bad shape. Some bits of the patch have been rescued in https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/871171 anyway

We invested half an afternoon researching how to better structure the file redirections and we reached a couple of valid solutions:

array = ["sh", "-c", "1>>stdout 2>>stderr ./cmd.sh" ]
array = ["sh", "-c", "./cmd.sh 1>>stdout 2>>stderr" ]
array = ["sh", "-c", "exec 1>>stdout; exec 2>>stderr; {user_cmd}" ]

But as of this writing it is undecided which layout to use. We will continue with this after the end-of-year break.

Change 871171 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] command: introduce filelog_std{out,err} parameters

https://gerrit.wikimedia.org/r/871171

Raymond_Ndibe closed this task as Resolved.Jan 17 2023, 2:10 PM

Raymond_Ndibe reopened this task as In Progress.Jan 17 2023, 2:35 PM

aborrero added a parent task: T327254: WMCS FY22/23 Q3: next steps in grid engine deprecation.Jan 18 2023, 11:44 AM

Change 881409 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] loaded: make it understand filelog, filelog_stdout, filelog_stderr

https://gerrit.wikimedia.org/r/881409

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:25 PM

fnegri moved this task from Kanban to Doing? (legacy column) on the cloud-services-team board.

fnegri moved this task from Doing? (legacy column) to FY2022/2023-Q3 on the cloud-services-team board.Jan 19 2023, 12:58 PM

fnegri edited projects, added cloud-services-team (FY2022/2023-Q3); removed cloud-services-team.

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q3) board.Jan 19 2023, 2:40 PM

whym mentioned this in T319568: Migrate archiving from Toolforge GridEngine to Toolforge Kubernetes.Jan 21 2023, 2:01 AM

Change 800864 abandoned by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] Allow specifying the path for log files

Reason:

included in https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-cli/+/881409

https://gerrit.wikimedia.org/r/800864

Change 881409 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] loader: make it understand filelog, filelog_stdout, filelog_stderr

https://gerrit.wikimedia.org/r/881409

aborrero mentioned this in rCTKFab42014cdf60: loader: make it understand filelog, filelog_stdout, filelog_stderr.Jan 23 2023, 4:59 PM

aborrero closed this task as Resolved.Jan 24 2023, 4:32 PM

dcaro moved this task from In progress to Done on the cloud-services-team (FY2022/2023-Q3) board.Feb 14 2023, 1:33 PM

Change 800865 abandoned by Arturo Borrero Gonzalez: