Page MenuHomePhabricator

Scap issues with stat hosts
Closed, ResolvedPublic

Description

stat[1004-1008],an-test-client1001 all have failing puppet runs due to scap issue. Below is the output of a puppet run on stat1006

$ sudo puppet agent -t                                                                        
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for stat1006.eqiad.wmnet
Info: Unable to serialize catalog to json, retrying with pson
Info: Applying configuration version '(0a95f7c2e5) Moritz Mühlenhoff - Deal with variant/custom mismatches in more places'
Error: Execution of '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False' returned 70: 
Error: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False' returned 70:  (corrective)
Notice: /Stage[main]/Profile::Analytics::Refinery::Repository/Scap::Target[analytics/refinery]/Package[analytics/refinery]/ensure: created (corrective)
Error: Execution of '/usr/bin/scap deploy-local --repo wikimedia/discovery/analytics -D log_json:False' returned 70: 
Error: /Stage[main]/Profile::Analytics::Cluster::Elasticsearch/Scap::Target[wikimedia/discovery/analytics]/Package[wikimedia/discovery/analytics]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo wikimedia/discovery/analytics -D log_json:False' returned 70:  (corrective)
Notice: /Stage[main]/Profile::Analytics::Hdfs_tools/File[/usr/local/bin/hdfs-rsync]: Dependency Package[analytics/hdfs-tools/deploy] has failures: true
Warning: /Stage[main]/Profile::Analytics::Hdfs_tools/File[/usr/local/bin/hdfs-rsync]: Skipping because of failed dependencies
Info: Stage[main]: Unscheduling all events on Stage[main]

Notice: Applied catalog in 58.46 seconds

Running the scap command manually i get the following results

$ sudo -u analytics-deploy /usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False
12:26:45 Fetch from: http://deploy1001.eqiad.wmnet/analytics/hdfs-tools/deploy/.git
12:26:45 Unhandled error:
Traceback (most recent call last):                                                                               
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/cli.py", line 532, in run
    exit_status = app.main(app.extra_arguments)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/deploy.py", line 161, in main
    getattr(self, stage)()
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/deploy.py", line 314, in fetch
    git.fetch(self.context.cache_dir, git_remote)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/git.py", line 348, in fetch
    gitcmd("fetch", *cmd, cwd=location)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/runcmd.py", line 91, in gitcmd
    return _runcmd(["git", subcommand] + list(args), **kwargs)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/runcmd.py", line 78, in _runcmd
    raise FailedCommand(argv, p.returncode, stdout, stderr)
scap.runcmd.FailedCommand: Command 'git fetch --tags --jobs 38 --no-recurse-submodules' failed with exit code 128;
stdout:

stderr:
fatal: unable to access 'http://deploy1001.eqiad.wmnet/analytics/hdfs-tools/deploy/.git/': The requested URL returned error: 503

12:26:45 deploy-local failed: <FailedCommand> Command 'git fetch --tags --jobs 38 --no-recurse-submodules' failed with exit code 128;
stdout:

stderr:
fatal: unable to access 'http://deploy1001.eqiad.wmnet/analytics/hdfs-tools/deploy/.git/': The requested URL returned error: 503

Event Timeline

Updating .config file in /srv/deployment/analytics/hdfs-tools/deploy-cache in order to have git_server set to deploy1002.eqiad.wmnet instead of deploy1001.eqiad.wmnet

This make the scap command running fine (sudo -u analytics-deploy /usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False)

But puppet run still try to reapply the scap target for those repositories while the state should now be fine

Here is the result of the debug puppet logs for one of those scap target

Debug: Executing: '/usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD'
Debug: scap pkg [analytics/hdfs-tools/deploy] root=/srv/deployment/analytics/hdfs-tools, user=analytics-deploy
Debug: Executing with uid=494: '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False'
Notice: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: created (corrective)
Debug: /Package[analytics/hdfs-tools/deploy]: The container Scap::Target[analytics/hdfs-tools/deploy] will propagate my refresh event

Seems that the /usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD used to identify if the target is avsent or present doesn't return expected value

Running the /usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD manually as the user owning the folder (analytics-deploy) works:
scap/sync/2020-02-28/0001

While running it as root failed

nfraison@stat1008:/srv/deployment/analytics/hdfs-tools/deploy$ sudo /usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD
fatal: detected dubious ownership in repository at '/srv/deployment/analytics/hdfs-tools/deploy-cache/revs/0c6e3ca61c094338d821ae7c73e244f1abb5b8bc'
To add an exception for this directory, call:

	git config --global --add safe.directory /srv/deployment/analytics/hdfs-tools/deploy-cache/revs/0c6e3ca61c094338d821ae7c73e244f1abb5b8bc

From puppet log it seems that this one is run as root and the second one as the owning user.
@hashar is that expected?

Running the /usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD manually as the user owning the folder (analytics-deploy) works:
scap/sync/2020-02-28/0001

While running it as root failed

nfraison@stat1008:/srv/deployment/analytics/hdfs-tools/deploy$ sudo /usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD
fatal: detected dubious ownership in repository at '/srv/deployment/analytics/hdfs-tools/deploy-cache/revs/0c6e3ca61c094338d821ae7c73e244f1abb5b8bc'
To add an exception for this directory, call:

	git config --global --add safe.directory /srv/deployment/analytics/hdfs-tools/deploy-cache/revs/0c6e3ca61c094338d821ae7c73e244f1abb5b8bc

From puppet log it seems that this one is run as root and the second one as the owning user.
@hashar is that expected?

This 100% T325128 which is caused by a recent git security update. I am guessing git has been upgraded on the server which triggers the issue. I got a patch for the deployment server to set safe.directory='*' in the global git config https://gerrit.wikimedia.org/r/c/operations/puppet/+/868002/1/modules/scap/manifests/master.pp . I guess we will need something similar for the targets.

Or we could ensure that the first call to get state is also run as the user owning the folder?

The git security update for safe.directory is intended exactly for that use case. A deployer could inject in the git repository some hook (as the deployment user), then when Puppet runs git as root on the repo, it might run some config or hook as root resulting in a privilege escalation.

I don't think scap runs anything with root privileges, it should do all operations as the deployment user. For Puppet, for sure all those git commands should not be run as root.

Or we could ensure that the first call to get state is also run as the user owning the folder?

Agreed, the best way to fix that for such a deployment target is to run all git command as the use owning the folder. This fixes the underlying issue and avoids spurious safe.directory overrides.

Change 891555 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] provider_scap3: update the query to execute as the deploy_user

https://gerrit.wikimedia.org/r/891555

Change 891557 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] scap - provider: update scap provider to run git with correct user

https://gerrit.wikimedia.org/r/891557

Running the /usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD manually as the user owning the folder (analytics-deploy) works:
scap/sync/2020-02-28/0001

In theory the this will make puppet run with the correct user but it needs some testing. Is there some host i can manually test on?

seems we mostly end up with same patch @jbond https://gerrit.wikimedia.org/r/891555 / https://gerrit.wikimedia.org/r/891557

I would be interested in reading/seeing how you do manual test of it. We could use an-test-client1001 for this test

Change 891557 abandoned by Jbond:

[operations/puppet@production] scap - provider: update scap provider to run git with correct user

Reason:

abanndon in favour of 891555 which also has the tests

https://gerrit.wikimedia.org/r/891557

seems we mostly end up with same patch @jbond https://gerrit.wikimedia.org/r/891555 / https://gerrit.wikimedia.org/r/891557

I would be interested in reading/seeing how you do manual test of it. We could use an-test-client1001 for this test

great minds ;), i have abandoned mine for yours. As to testing i would:

  • disable puppet
  • manually copy the new scap3.rb file to an-test-client1001:/var/lib/puppet/lib/puppet/provider/package/scap3.rb
  • make the above file immutable chattr +i /var/lib/puppet/lib/puppet/provider/package/scap3.rb (this prevents pluginsync from copying back the current file before execution)
  • enable and run puppet, you will get an error that puppet dosn't have permission to overwrite scap3.rb but it can be ignored for the purpose of the test
  • once testing has completed remove the immutable flag chattr -i /var/lib/puppet/lib/puppet/provider/package/scap3.rb and re-reun puppet to restore the old file

Seems to work fine
Before trying to redeploy analytics/hdfs-tools/deploy

Info: Unable to serialize catalog to json, retrying with pson
Info: Applying configuration version '(cb9d9be2dc) Muehlenhoff - Switch puppetdb to profile::java'
Error: Execution of '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False' returned 70: 
Error: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo analytics/hdfs-tools/deploy -D log_json:False' returned 70:  (corrective)
Notice: /Stage[main]/Profile::Analytics::Refinery::Repository/Scap::Target[analytics/refinery]/Package[analytics/refinery]/ensure: created (corrective)
Notice: /Stage[main]/Profile::Analytics::Hdfs_tools/File[/usr/local/bin/hdfs-rsync]: Dependency Package[analytics/hdfs-tools/deploy] has failures: true
Warning: /Stage[main]/Profile::Analytics::Hdfs_tools/File[/usr/local/bin/hdfs-rsync]: Skipping because of failed dependencies
Notice: /Stage[main]/Profile::Airflow/Airflow::Instance[analytics_test]/Scap::Target[airflow-dags/analytics_test]/Package[airflow-dags/analytics_test]/ensure: created (corrective)
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 60.36 seconds

With the scap3 provider updated

Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for an-test-client1001.eqiad.wmnet
Info: Unable to serialize catalog to json, retrying with pson
Info: Applying configuration version '(cb9d9be2dc) Muehlenhoff - Switch puppetdb to profile::java'
Notice: Applied catalog in 55.28 seconds

Change 891555 merged by Nicolas Fraison:

[operations/puppet@production] provider_scap3: update the query to execute as the deploy_user

https://gerrit.wikimedia.org/r/891555