Page MenuHomePhabricator

Running maintenance scripts in screen on `deploy1003` appears to fail, but is still running
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Run a time-consuming script in a screen on deploy1003 using mwscript-k8s. (In this case I was running the following:
mwscript-k8s --comment=T398900 --follow --verbose --file=/srv/mediawiki-staging/php-1.45.0-wmf.12/extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php -- extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php --wiki=mediawikiwiki --edit-count-table=securepoll_bv2025_edits --list-name=board-vote-2025 --short-min-edits=20 --long-min-edits=300

(This usually takes about 8 hours)

What happens?:

After a while the logging seems to suggest the script suddenly failed:

[...]
42,920,000 of 79,013,591 ; 54.32% ; estimated time remaining: 3 hours, 20 minutes and 39 seconds
42,955,000 of 79,013,591 ; 54.36% ; estimated time remaining: 3 hours, 20 minutes and 26 seconds
42,988,000 of 79,013,591 ; 54.41% ; estimated time remaining: 3 hours, 20 minutes and 14 seconds
43,021,000 of 79,013,591 ; 54.45% ; estimated time remaining: 3 hours, 20 minutes and 3 seconds
43,050,000 of 79,013,591 ; 54.48% ; estimated time remaining: 3 hours, 19 minutes and 53 seconds
43,080,000 of 79,013,591 ; 54.52% ; estimated time remaining: 3 hours, 19 minutes and 43 seconds
43,110,000 of 79,013,591 ; 54.56% ; estimated time remaining: 3 hours, 19 minutes and 33 seconds
43,139,000 of 79,013,591 ; 54.60% ; estimated time remaining: 3 hours, 19 minutes and 24 seconds
43,169,000 of 79,013,591 ; 54.63% ; estimated time remaining: 3 hours, 19 minutes and 14 seconds
43,191,000 of 79,013,591 ; 54.66% ; estimated time remaining: 3 hours, 19 minutes and 9 seconds
foks@deploy1003:~$

In fact, the script is still running right now.

What should have happened instead?:

Script should have continued to output logs.

Event Timeline

A couple of related things that are not debugging the issue per se.
mwscript-k8s does not need to be run in a screen or tmux to persist for long running scripts as it is only a wrapper that launches the script to run on the kubernetes cluster. As such, the script will always run to completion, or until something like an OOM or a host drain forces the Pod it runs in to be killed.
You can find your job and rejoin the log stream at any point in the script run.

If you have timestamps for when your log output stopped, we can check what happened, but my guess would be that, as mwscript-k8s only wraps kubectl logs -f for log tailing, something happened that caused kubectl logs -f to exit, probably a kubernetes API server restart.

Tagging @RLazarus because we can probably catch this sort of exit and output a reassuring message and the proper command to rejoin the log stream.

Tagging @RLazarus because we can probably catch this sort of exit and output a reassuring message and the proper command to rejoin the log stream.

Hmm, good idea. I hope we can tell the difference from kubectl's exit status -- it's not documented AFAICT but I'll see if I can reproduce and/or dig around in the code. (It's also possible we get a nonzero exit status when the API server restarts, but not if the stream just times out or something.)

We could also just wait until kubectl logs terminates, then unconditionally send an API request to see if the job's done. That feels like a bunch of extra moving parts at exit time, but I actually don't hate it -- it means if the job is done but exited with an error, or got OOM killed, etc., we can also let the user know.

If the job is still running, we could maybe even restart kubectl logs automatically, but we probably shouldn't: there's some risk of lines being skipped or repeated. That wouldn't matter here where it's just progress updates, but for some scripts the user would really want to know.


Separately, @jrbs just fyi:

[...] --file=/srv/mediawiki-staging/php-1.45.0-wmf.12/extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php -- extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php [...]

This might not do what you want. --file makes a copy from the deployment host into /data in the container (that is, a new copy separate from the complete MediaWiki installation that's already in there) but you aren't using that new copy.

If you've made local changes to makeGlobalVoterList.php and you're trying to run your modified version, you should say

[...] --file=/srv/mediawiki-staging/php-1.45.0-wmf.12/extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php -- /data/makeGlobalVoterList.php [...]

If you haven't modified it and you're just running the normal script, then what you're doing will work fine but the --file flag isn't doing anything for you, you can remove it:

[...] -- extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php [...]

mwscript-k8s does not need to be run in a screen or tmux to persist for long running scripts as it is only a wrapper that launches the script to run on the kubernetes cluster. As such, the script will always run to completion, or until something like an OOM or a host drain forces the Pod it runs in to be killed.
You can find your job and rejoin the log stream at any point in the script run.

Whoa! TIL. I was just following my usual process for the maintenance server (documented here) but if mwscript-k8s essentially serves as a screen in and of itself perhaps I should adjust. (Though we do need to run them on all eight sections concurrently otherwise it would take much longer)

If you have timestamps for when your log output stopped, we can check what happened, but my guess would be that, as mwscript-k8s only wraps kubectl logs -f for log tailing, something happened that caused kubectl logs -f to exit, probably a kubernetes API server restart.

Unfortunately I don't have timestamps in my terminal.

Separately, @jrbs just fyi:

[...] --file=/srv/mediawiki-staging/php-1.45.0-wmf.12/extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php -- extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php [...]

This might not do what you want. --file makes a copy from the deployment host into /data in the container (that is, a new copy separate from the complete MediaWiki installation that's already in there) but you aren't using that new copy.

If you've made local changes to makeGlobalVoterList.php and you're trying to run your modified version, you should say

[...] --file=/srv/mediawiki-staging/php-1.45.0-wmf.12/extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php -- /data/makeGlobalVoterList.php [...]

If you haven't modified it and you're just running the normal script, then what you're doing will work fine but the --file flag isn't doing anything for you, you can remove it:

[...] -- extensions/SecurePoll/maintenance/wm-scripts/makeGlobalVoterList.php [...]

Thanks for the pointers, this is very helpful and makes a lot of sense!

mwscript-k8s does not need to be run in a screen or tmux to persist for long running scripts as it is only a wrapper that launches the script to run on the kubernetes cluster. As such, the script will always run to completion, or until something like an OOM or a host drain forces the Pod it runs in to be killed.
You can find your job and rejoin the log stream at any point in the script run.

Whoa! TIL. I was just following my usual process for the maintenance server (documented here) but if mwscript-k8s essentially serves as a screen in and of itself perhaps I should adjust. (Though we do need to run them on all eight sections concurrently otherwise it would take much longer)

There's no issue with running mwscript-k8s inside a terminal multiplexer if you need multiple terminals so you can launch multiple scripts, or just want that persistent output. The biggest user experience difference for mwscript-k8s is that if you ctrl-c, or didn't launch in a multiplexer and your connection to the deployment server fails, the script will continue running as if nothing happened.

I see from that wikitech page that most documented calls use the legacy mwscript wrapper, which is deprecated and should not be used except in very particular cases. Don't hesitate to tell us if you need help updating the documentation to mwscript-k8s calls.

Thanks for the help folks! So for this task, I don't think anything is needed - there is no bug here. But perhaps the failure to see logs within screen is a problem? Maybe it is silently exiting the k8s log view or something.

I think this happens because the logs follow is a kubelet-backed read, not a kube-apiserver-backed one, and as such it has a 4 hours read timeout backstop independent of the connection being idle or not.