Page MenuHomePhabricator

scap should check if it is running within a tmux/screen
Open, HighPublic

Description

If while a deployer is running scap, their local connection goes down, there is chance of helm being left in a limbo state - see T361720. Deployers should always run scap within a server side tmux/screen

we should

  • update relevant documentation to stress how important it is to run within a tmux/screen session
  • add a warning in scap itself, so a deployer will be reminded/encouraged to use tmux/screen
    • possibly add a switch to override this warning

Details

TitleReferenceAuthorSource BranchDest Branch
utils.py: add method to check if scap is running in a screen tmuxrepos/releng/scap!271jijieffie-screenmaster
Prevent scap from running outside of a screen or a tmuxrepos/releng/scap!266jijieffie-screenmaster
Customize query in GitLab

Event Timeline

jijiki triaged this task as High priority.Apr 3 2024, 4:53 PM
jijiki created this task.
jijiki updated the task description. (Show Details)

Deployers should always run scap within a server side tmux/screen

I think if this is actually the best solution then some wrapper should be created for scap that makes the tmux session happen automagically instead of making it a nag workflow that folks have to remember to do manually.

There is a tmux/screen check for scap stage-train, but nothing else. This could be factored out to cover other scap subcommands.
Suggestions:

  • scap backport
  • scap deploy
  • scap deploy-promote
  • scap sync-*
  • scap train
  • scap stage-train
  • scap lock

Applying the tmux/screen check for all scap subcommands (especially those unrelated to deployment) is definitely undesirable.

I might have used screen back in the old days (like in 2005 or so) and might have tried screen / tmux at some point in the early 2010's when I have started working for the WMF. Overall, my sole use case would have been for running deployment tools (Trebuchet, copy pasting commands, then scap). After 20+ years I have never felt like I should invest time in learning the magic keyboard shortcuts to navigate through them. I don't think I had more than a couple network interruptions while doing a deployment, and thus they are not in my habits.

Ok that is not an excuse :)

Would it be possible for scap to replace itself with the same invocation but wrapped in screen/tmux? Maybe something such as:

if 'TMUX' not in os.environ or 'STY' not in os.environ:
    multiplexer = shutil.which( 'tmux' ) or shutil.which( 'screen' )
    if multiplexer is None:
        log.info("Restarting with %s" % multiplexer)
        os.execvpe( sys.argv[0], sys.argv[1:], os.environ )

As a side note scap might well be interrupted with a +C which might leave the ongoing helm deployment in a weird state but that seems to be for scap to learn about such unexpected state which is T361720.

For this use case the only keybinding you would need to know is how to exit once your run is done, which you would do the same way you exit a shell, with exit or ^+d.

I agree that switching to the multiplexer should happen transparently for the deployer, though they would still need to learn how to reattach a session, which is still way better than learning how to unbreak helm for something as avoidable as a connection loss.

jiji opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/271

utils.py: add method to check if scap is running in a screen tmux

There is a tmux/screen check for scap stage-train, but nothing else. This could be factored out to cover other scap subcommands.
Suggestions:

  • scap backport
  • scap deploy
  • scap deploy-promote
  • scap sync-*
  • scap train
  • scap stage-train
  • scap lock

Applying the tmux/screen check for all scap subcommands (especially those unrelated to deployment) is definitely undesirable.

I see your point, I rushed a bit into it. I updated the MR, and limited it to just introduce utils.is_shell_durable(). It would be great (and faster) if someone with better scap code knowledge than me, would inject it in the appropriate subcommands.

I agree that switching to the multiplexer should happen transparently for the deployer, though they would still need to learn how to reattach a session, which is still way better than learning how to unbreak helm for something as avoidable as a connection loss.

While it would be convenient, I think it will be confusing overall, thus I reckon we should just warn users to restart what they were doing, via a screen or tmux. The learning curve is quite small, and picking up the habit of starting a screen or a tmux, can't be a terrible idea :)

@dancy it would be great if someone could finish this soon. While scap now does have an option to mitigate potential helm hiccups, I think we should add it to the mix nevertheless