Page MenuHomePhabricator

scap broken on deploy1002 / deploy2002 (buster)
Closed, ResolvedPublic

Description

lucaswerkmeister-wmde@deploy1002 ~ $ scap help
Traceback (most recent call last):
  File "/usr/bin/scap", line 32, in <module>
    from scap import cli  # noqa: E402
ImportError: cannot import name 'cli' from 'scap' (unknown location)

lucaswerkmeister-wmde@deploy2002 ~ $ scap help
Traceback (most recent call last):
  File "/usr/bin/scap", line 32, in <module>
    from scap import cli  # noqa: E402
ImportError: cannot import name 'cli' from 'scap' (unknown location)

Seems to happen since the deployment of scap 4.94.0 for T371255. I don’t know if this is important or if these hosts are about to be completely decommissioned anyway.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
install_world: use wheels to install scap on secondary deployment serversrepos/releng/scap!393jnucheT371261master
Customize query in GitLab

Event Timeline

Note that this currently shows up in deployments as errors in the sync-masters step, see T371255#10023148.

lucaswerkmeister-wmde@deploy1002 ~ $ dpkg -L scap
dpkg-query: package 'scap' is not installed
Use dpkg --contents (= dpkg-deb --contents) to list archive files contents.

wat

Edit: Nevermind, it’s this way on deploy1003 too. I guess we just deploy scap outside of dpkg/apt? Idk.

Aha, scap is now trying to use Python 3.9 (the Python 3.7 dir in the venv is just a little leftover stub apparently):

lucaswerkmeister-wmde@deploy1002 ~ $ ls /var/lib/scap/scap/lib/python3.?/site-packages/scap/
/var/lib/scap/scap/lib/python3.7/site-packages/scap/:
plugins  __pycache__

/var/lib/scap/scap/lib/python3.9/site-packages/scap/:
ansi.py     cdblib.py  cmd.py	   deploy_promote.py  history.py	interaction.py	lock.py		     main.py		php_fpm.py  script.py	    targets.py	 train.py
arg.py	    checks.py  config.py   deploy.py	      __init__.py	kubernetes.py	log.py		     nrpe.py		plugins     ssh.py	    tasks.py	 utils.py
browser.py  cli.py     context.py  git.py	      install_world.py	lint.py		logstash_checker.py  phorge_conduit.py	runcmd.py   stage_train.py  template.py  version.py

But deploy1002 only has Python 3.7 installed AFAICT.

Note that this currently shows up in deployments as errors in the sync-masters step, see T371255#10023148.

It also makes scap exit with an error:

14:41:38 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=lucaswerkmeister-wmde', 'Backport for [[gerrit:1057893|Revert "TranslatablePage: Split translatable page id cache into multiple shards" (T366455)]]']' returned non-zero exit status 1.

Yes this is Python virtualenv related. I 've tried some simple fixes already but didn't work

However:

  • I am gonna be decommissioning deploy1002 this week
  • I 'll be reimaging deploy2002 (deploy1003 already is) tomorrow as bullseye

So, if we can wait this out until tomorrow, we 'll have a solution without having to solve this more nicely (and we revisit in it in 2 years, for the bookworm migration)

That works for me; deployers should just be aware that these messages will show up until then (two yellow errors in sync-masters, and a red nonzero exit at the very end).

The problem is caused by the scap installer. It erroneously assumes all deployment servers are on the same distro.

I think I should be able to change the installer to take the actual distro (and the Python version) intro account. Looking into it.

The problem is caused by the scap installer. It erroneously assumes all deployment servers are on the same distro.

Yeah, that's not a bad assumption to make. It false for like 2-3 days every 5 years?

I think I should be able to change the installer to take the actual distro (and the Python version) intro account. Looking into it.

Thanks! I 'd suggest to not waste too much time. If you can get it working quickly, yay! Otherwise, it's probably not worth it.

Thanks! I 'd suggest to not waste too much time. If you can get it working quickly, yay! Otherwise, it's probably not worth it.

That's a good point. The installer already checks the distro of regular targets when deciding which wheels to install, I should be able to use that for the deployment servers too. But if it gets too messy I'll probably drop it.

As a heads up

I 've re-imaged deploy2002 T371282 already to bullseye, to avoid finding ourselves in a situation where we can't deploy in case of something happening and deploy1003 being unavailable.

I have filed T371283 for decom of deploy1002, but I 'll stall it so you have something to test against.

I have filed T371283 for decom of deploy1002, but I 'll stall it so you have something to test against.

Thanks, appreciate it :) I'll keep you posted.

jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/393

install_world: use wheels to install scap on secondary deployment servers

jnuche merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/393

install_world: use wheels to install scap on secondary deployment servers

Got scap working on deploy1002 again:

[jnuche@deploy1002 ~]$ scap version
4.95.0-1

That's made the sync-masters errors disappear and backports are back to normal.

The new scap version affects only the installer and no other functionality. I'll wait for a window today and roll it out to all hosts. Once I verify everything is looking good, I'll close this task.

@akosiaris there is some kind of issue with the ssh config for user scap in deploy2002.

When the scap installer tried to update scap there I got:

14:21:00 ['/usr/bin/rsync', '--archive', '--delay-updates', '--delete', '--delete-delay', '--compress', '--new-compress', '--exclude=*.swp', '--exclude=**/__pycache__', 'deploy1003.eqiad.wmnet::scap-install-staging', '/var/lib/scap'] (ran as scap@deploy2002.codfw.wmnet) returned [1]: This account is currently not available.

And sure enough, it reproduces when trying to ssh manually:

[jnuche@deploy1003 ~]$ ssh [...] -l scap deploy2002.codfw.wmnet
Linux deploy2002 5.10.0-31-amd64 #1 SMP Debian 5.10.221-1 (2024-07-14) x86_64
Debian GNU/Linux 11 (bullseye)
[...]
This account is currently not available.
Connection to deploy2002.codfw.wmnet closed.
[jnuche@deploy1003 ~]$

I can still use user scap to ssh to deploy1002 though.

So the problem with deploy2002 was the user didn't have a login specified in puppet and that defaulted to /usr/sbin/nologin that disables login for the account.

Fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058194 Thanks a lot to @Joe for the fix.

I have now successfully updated scap to the latest version everywhere, including deploy2002.

jnuche claimed this task.