Page MenuHomePhabricator

Scap fails on debian bullseye targets
Closed, ResolvedPublic

Description

On scap targets running debian bullseye:

netbox-dev2002:~$ /usr/bin/scap
Traceback (most recent call last):
  File "/usr/bin/scap", line 39, in <module>
    from scap import cli
ModuleNotFoundError: No module named 'scap'

/usr/bin/scap is a symlink to /var/lib/scap/scap/bin/scap indicating that this host has been properly configured and bootstrapped by puppet to be a scap target. (xref T303559)

/var/lib/scap/scap/bin/scap inserts /var/lib/scap/scap at the front of the python module search path. /var/lib/scap/lib/python3.7/site-packages/scap/ contains the module that we're attempting to import. However, bullseye targets run Python 3.9. The stuff ends up in the python3.7 directory because it is prepared on deploy1002 which runs buster.

We'll need a way to produce a tree with deps for 3.7 and 3.9 (or find a way to use one set for both). Hacking and testing are in order.

Details

TitleReferenceAuthorSource BranchDest Branch
Add binary image supportrepos/releng/scap!60dancyreview/dancy/binary-distmaster
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJan 10 2023, 7:06 PM
netbox-dev2002$ ls -l /var/lib/scap/scap/lib/
4.0K drwxr-xr-x 3 scap scap 4.0K Jan  4 19:07 python3.7

scap install-world already has hacks in place to deal with different python versions ( https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/install_world.py#L271 ) so it's unclear why the symlink wasn't set up. I ran scap install-world -v --limit-hosts netbox-dev2002.codfw.wmnet and scap started working again.

And the lib directory afterward:

$ ls -l /var/lib/scap/scap/lib
total 4
drwxr-xr-x 3 scap scap 4096 Jan 10 19:37 python3.7
lrwxrwxrwx 1 scap scap   32 Jan 10 19:38 python3.9 -> /var/lib/scap/scap/lib/python3.7

I've run scap install-world at least once last week.

webperf1003.eqiad.wmnet has this problem. Here's how to test and investigate:

From deploy1002:

$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock  ssh -i/etc/keyholder.d/scap scap@webperf1003.eqiad.wmnet scap version
Traceback (most recent call last):
  File "/usr/bin/scap", line 39, in <module>
    from scap import cli
ModuleNotFoundError: No module named 'scap'

$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock  ssh -i/etc/keyholder.d/scap scap@webperf1003.eqiad.wmnet ls -l /var/lib/scap/scap/lib
total 4
drwxr-xr-x 3 scap scap 4096 Jan  4 19:07 python3.7

Python virtual environments should not be copied, they are inherently non-portable, even on the same system.
See for example the warning box on the venv documentation: https://docs.python.org/3/library/venv.html#how-venvs-work (need to scroll a bit down).
In addition, any compiled file like .pyc, .so, .pyd is bond to the version of the binaries and ABIs that generated it.

And the lib directory afterward:

$ ls -l /var/lib/scap/scap/lib
total 4
drwxr-xr-x 3 scap scap 4096 Jan 10 19:37 python3.7
lrwxrwxrwx 1 scap scap 32 Jan 10 19:38 python3.9 -> /var/lib/scap/scap/lib/python3.7```

The above can't be the desired outcome for many reasons:

  • It doesn't work for any C extension:
$ tree /var/lib/scap/scap/lib/python3.9/site-packages/markupsafe
/var/lib/scap/scap/lib/python3.9/site-packages/markupsafe
โ”œโ”€โ”€ _compat.py
โ”œโ”€โ”€ _constants.py
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ _native.py
โ”œโ”€โ”€ __pycache__
โ”‚ย ย  โ”œโ”€โ”€ _compat.cpython-39.pyc
โ”‚ย ย  โ”œโ”€โ”€ __init__.cpython-39.pyc
โ”‚ย ย  โ””โ”€โ”€ _native.cpython-39.pyc
โ”œโ”€โ”€ _speedups.c
โ””โ”€โ”€ _speedups.cpython-37m-x86_64-linux-gnu.so

1 directory, 9 files
  • Nor for any python wheel that depends on the Python version in /var/lib/scap/scap/share/python-wheels (currently none luckily)
  • It has inconsistencies in the venv binaries that points to different version than the name:
$ tree /var/lib/scap/scap/bin/
/var/lib/scap/scap/bin/
โ”œโ”€โ”€ activate
โ”œโ”€โ”€ activate.csh
โ”œโ”€โ”€ activate.fish
โ”œโ”€โ”€ chardetect
โ”œโ”€โ”€ easy_install
โ”œโ”€โ”€ easy_install-3.7
โ”œโ”€โ”€ install_local_version.sh
โ”œโ”€โ”€ normalizer
โ”œโ”€โ”€ pip
โ”œโ”€โ”€ pip3
โ”œโ”€โ”€ pip3.7
โ”œโ”€โ”€ pygmentize
โ”œโ”€โ”€ python -> python3
โ”œโ”€โ”€ python3 -> /usr/bin/python3
โ”œโ”€โ”€ scap
โ””โ”€โ”€ wheel

0 directories, 16 files

$ head -n1 /var/lib/scap/scap/bin/pip3.7
#!/var/lib/scap/scap/bin/python3

$ ls -la /var/lib/scap/scap/bin/python3
lrwxrwxrwx 1 scap scap 16 Jan 10 19:37 /var/lib/scap/scap/bin/python3 -> /usr/bin/python3

$ /usr/bin/python3 -V
Python 3.9.2
  • It also has metadata inconsistencies:
$ cat /var/lib/scap/scap/pyvenv.cfg
home = /usr/bin
include-system-site-packages = false
version = 3.7.3

I'm trying to determine what caused Scap to break this week, when deploying to the same webperf1003 server (running Bullseye for almost a year now) from deploy1002 worked fine last week.

Looking at uptime and dpkg logs:

  • webperf1003 uptime: 78 days
  • deploy1002 dpkg: Python 3.9.2 since at least 2022-04-20
  • deploy1002 uptime: 261 days
  • deploy1002 dpkg: Python 3.7.3 since at least 2022-05-12

1krinkle@webperf1003:
2$ cat /var/log/dpkg.log*.gz | gunzip | ack 'python.* 3\.'
32022-04-20 12:43:22 install libpython3.9-minimal:amd64 <none> 3.9.2-1
4..
52022-04-20 12:54:41 status installed libpython3.9:amd64 3.9.2-1
6
7$ cat /var/log/dpkg.log{,.1} | ack python
8$
9
10[21:25 UTC] krinkle at deploy1002.eqiad.wmnet
11$ cat /var/log/dpkg.log*.gz | gunzip | ack 'python.* 3\.'
122022-05-12 09:14:01 configure python3-venv:amd64 3.7.3-1 <none>
13..
142022-05-12 09:14:01 status unpacked python3-venv:amd64 3.7.3-1
15
16$ cat /var/log/dpkg.log{,.1} | ack python
172022-12-07 12:36:31 status half-configured libpython3.7:amd64 3.7.3-2+deb10u3
18..
192022-12-07 12:36:58 status installed python3.7:amd64 3.7.3-2+deb10u4
20$

I ran scap install-world -v --limit-hosts netbox-dev2002.codfw.wmnet and scap started working again.

Looks like this works for our purposes as well. Running now on webperf servers so that we're unblocked for immediate navtiming deployments.

Mentioned in SAL (#wikimedia-operations) [2023-01-12T21:59:34Z] <Krinkle> krinkle@deploy1002$ scap install-world -v --limit-hosts for webperf1003.eqiad and webperf2003.codfw, ref T326668

IMHO I don't think we should add yet-another way of deploying Python code in production using PyInstaller, in addition to the existing ones outlined in T180023.
Among the problems that I see using PyInstaller is that it will not use the OS-provided installed Python, bypassing our standard ways to track and timely deploy security updates when needed.

IMHO I don't think we should add yet-another way of deploying Python code in production using PyInstaller, in addition to the existing ones outlined in T180023.
Among the problems that I see using PyInstaller is that it will not use the OS-provided installed Python, bypassing our standard ways to track and timely deploy security updates when needed.

+1. This should be fixed in Scap, not worked around by bundling a full Python stack for each distro.

this seems to be affecting all of the stat, logstash and net{mon,box} listed on https://puppetboard.wikimedia.org/nodes?status=failed. We see the following in the puppet output

Error: Execution of '/usr/bin/scap deploy-local --repo phabricator/deployment -D log_json:False' returned 70: 
Error: /Stage[main]/Profile::Phabricator::Aphlict/Scap::Target[phabricator/deployment]/Package[phabricator/deployment]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo phabricator/deployment -D log_json:False' returned 70:

or
logstash

Error: Execution of '/usr/bin/scap deploy-local --repo releng/phatality -D log_json:False' returned 1: 
Error: /Stage[main]/Opensearch_dashboards::Phatality/Scap::Target[releng/phatality]/Package[releng/phatality]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo releng/phatality -D log_json:False' returned 1:  (corrective)

stat (possibly a slightly different issue)

Error: Execution of '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False' returned 70: 
Error: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False' returned 70:  (corrective)
Notice: /Stage[main]/Profile::Analytics::Refinery::Repository/Scap::Target[analytics/refinery]/Package[analytics/refinery]/ensure: created (corrective)
Error: Execution of '/usr/bin/scap deploy-local --repo wikimedia/discovery/analytics -D log_json:False' returned 70: 
Error: /Stage[main]/Profile::Analytics::Cluster::Elasticsearch/Scap::Target[wikimedia/discovery/analytics]/Package[wikimedia/discovery/analytics]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo wikimedia/discovery/analytics -D log_json:False' returned 70:  (corrective)

@jbond, that seems a different issue while installing Scap3 services. This ticket's issue is about a problem installing Scap itself via the self-installer.

I don't have access to that link you shared. Where those hosts being freshly installed with the services? If puppet recovered on the second run and scap succeeded in installing the service at that point, then this might be something I saw before once while provisioning a new Jenkins host with a Scap3 deployment. I couldn't reproduce though.

Looks like @Dzahn already filed a separate ticket for this: https://phabricator.wikimedia.org/T330361

Didn't have a closer look at the failure, but noticed a correlation: logstash, netmon and stat all have the security update of git, which added an owner check which prevents git operations by a user different than the one which owns the .git directory within the repository, we may be missing some safe.directory directives here?

For logstash and arclamp hosts the scap module doesn't seem to be found/loadable:

deploy-service@logstash1032:~$ scap  deploy-local --repo releng/phatality -D log_json:False
Traceback (most recent call last):
  File "/usr/bin/scap", line 44, in <module>
    from scap import cli
ModuleNotFoundError: No module named 'scap'

@fgiunchedi as a quickfix I manually installed scap's latest version on those hosts (arclamp and logstash):

scap@logstash1032:~$ scap version
4.38.0-1

Currently working on the actual fix for the scap self-install problems

@jbond, that seems a different issue while installing Scap3 services. This ticket's issue is about a problem installing Scap itself via the self-installer.

Thanks junche, I think there could be different issues from what i can tell we have:

logstash[1023-1025,1030-1031], arclamp1001,webperf1003 & netmon1003 are all bullseye servers and result in the same error posted in the original description and above. There are also two other unrelated scap issues so i raised them separately as T330394 and T330393

@fgiunchedi as a quickfix I manually installed scap's latest version on those hosts (arclamp and logstash):

scap@logstash1032:~$ scap version
4.38.0-1

Currently working on the actual fix for the scap self-install problems

Thank you @jnuche, appreciate it! Can confirm that scap runs fine now, the provider is causing a trigger on each puppet run, though I think that's a different issue (I think the same that https://gerrit.wikimedia.org/r/c/operations/puppet/+/891555/ addresses)

arclamp1001:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for arclamp1001.eqiad.wmnet
Info: Applying configuration version '(cb9d9be2dc) Muehlenhoff - Switch puppetdb to profile::java'
Notice: /Stage[main]/Arclamp/Scap::Target[performance/arc-lamp]/Package[performance/arc-lamp]/ensure: created (corrective)
Info: /Stage[main]/Arclamp/Scap::Target[performance/arc-lamp]/Package[performance/arc-lamp]: Scheduling refresh of Systemd::Service[excimer-log]
Info: /Stage[main]/Arclamp/Scap::Target[performance/arc-lamp]/Package[performance/arc-lamp]: Scheduling refresh of Systemd::Service[excimer-wall-log]
Info: /Stage[main]/Arclamp/Scap::Target[performance/arc-lamp]/Package[performance/arc-lamp]: Scheduling refresh of Systemd::Service[excimer-k8s-log]
Info: /Stage[main]/Arclamp/Scap::Target[performance/arc-lamp]/Package[performance/arc-lamp]: Scheduling refresh of Systemd::Service[excimer-k8s-wall-log]
Info: Systemd::Service[excimer-log]: Scheduling refresh of Service[excimer-log]
Info: Systemd::Service[excimer-log]: Scheduling refresh of Systemd::Unit[excimer-log]
Info: Systemd::Service[excimer-wall-log]: Scheduling refresh of Service[excimer-wall-log]
Info: Systemd::Service[excimer-wall-log]: Scheduling refresh of Systemd::Unit[excimer-wall-log]
Info: Systemd::Service[excimer-k8s-log]: Scheduling refresh of Service[excimer-k8s-log]
Info: Systemd::Service[excimer-k8s-log]: Scheduling refresh of Systemd::Unit[excimer-k8s-log]
Info: Systemd::Service[excimer-k8s-wall-log]: Scheduling refresh of Service[excimer-k8s-wall-log]
Info: Systemd::Service[excimer-k8s-wall-log]: Scheduling refresh of Systemd::Unit[excimer-k8s-wall-log]
Info: Systemd::Unit[excimer-log]: Scheduling refresh of Exec[systemd daemon-reload for excimer-log.service (excimer-log)]
Info: Systemd::Unit[excimer-wall-log]: Scheduling refresh of Exec[systemd daemon-reload for excimer-wall-log.service (excimer-wall-log)]
Info: Systemd::Unit[excimer-k8s-log]: Scheduling refresh of Exec[systemd daemon-reload for excimer-k8s-log.service (excimer-k8s-log)]
Info: Systemd::Unit[excimer-k8s-wall-log]: Scheduling refresh of Exec[systemd daemon-reload for excimer-k8s-wall-log.service (excimer-k8s-wall-log)]
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer]/Systemd::Service[excimer-log]/Systemd::Unit[excimer-log]/Exec[systemd daemon-reload for excimer-log.service (excimer-log)]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer]/Systemd::Service[excimer-log]/Service[excimer-log]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer-wall]/Systemd::Service[excimer-wall-log]/Systemd::Unit[excimer-wall-log]/Exec[systemd daemon-reload for excimer-wall-log.service (excimer-wall-log)]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer-wall]/Systemd::Service[excimer-wall-log]/Service[excimer-wall-log]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer-k8s]/Systemd::Service[excimer-k8s-log]/Systemd::Unit[excimer-k8s-log]/Exec[systemd daemon-reload for excimer-k8s-log.service (excimer-k8s-log)]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer-k8s]/Systemd::Service[excimer-k8s-log]/Service[excimer-k8s-log]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer-k8s-wall]/Systemd::Service[excimer-k8s-wall-log]/Systemd::Unit[excimer-k8s-wall-log]/Exec[systemd daemon-reload for excimer-k8s-wall-log.service (excimer-k8s-wall-log)]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Arclamp/Arclamp::Profiler[excimer-k8s-wall]/Systemd::Service[excimer-k8s-wall-log]/Service[excimer-k8s-wall-log]: Triggered 'refresh' from 1 event
Notice: Applied catalog in 21.92 seconds

Bullseye hosts now get specific Python wheels created on our base bullseye image.