deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krenair
	Apr 9 2016, 7:56 PM

Description

Error: Execution of '/usr/bin/deploy-local --repo ores/deploy -D log_json:False' returned 70: http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git
Error: Execution of '/usr/bin/deploy-local --repo analytics/aqs/deploy -D log_json:False' returned 70

Details

	Subject	Repo	Branch	Lines +/-
	Fix up deployment-prep scap config	analytics/aqs/deploy	master	+4 -4
	scap: A basic workaround for the git clone issue	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	EddieGP	T132259 Deployment-prep hosts with puppet errors (tracking)
Resolved	None	T116206 Set up AQS in Beta
Resolved	Ladsgroup	T132267 deployment-((sca\|aqs)01\|ores-web) puppet failures due to scap3 errors

Event Timeline

Krenair created this task.Apr 9 2016, 7:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2016, 7:56 PM

Krenair edited subscribers, added: • mmodell; removed: 20after4.Apr 9 2016, 8:28 PM

Krenair moved this task from To Triage to Next: Maintenance on the Beta-Cluster-Infrastructure board.Apr 9 2016, 11:45 PM

This is likely because .git/DEPLOY_HEAD does not exist for these repos on deployment-tin

aqs deployment failures

It looks like the scap puppet provider is attempting to deploy analytics/aqs/deploy from deployment-tin; however there is no /srv/deployment/analytics/aqs/deploy on deployment-tin.

Although, I believe some recently merged puppet stuffs should be creating this repo on deployment-tin. Something that needs more investigation.

ores deployment failures

I'm not clear on why ores deploy is failing. I've been working on deployment-sca01. One problem is that the puppet provider assumes that the deployed repository on the target box will be under /srv/deployment https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/lib/puppet/provider/package/scap3.rb#L35

Since ores is deploying to /srv/ores the deploy may be succeeding when run as the deploy-service user

deploy-service@deployment-sca01:~$ /usr/bin/deploy-local --repo ores/deploy -D log_json:False
14:55:44 INFO     - Starting new HTTP connection (1): deployment-tin.deployment-prep.eqiad.wmflabs
http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git
From http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/
 * [new branch]      master     -> origin/master
 * [new branch]      prod       -> origin/prod
 * [new branch]      scap_again -> origin/scap_again
14:55:44 Revision directory already exists (use --force to override)
14:55:44 Starting new HTTP connection (1): deployment-tin.deployment-prep.eqiad.wmflabs
14:55:44 /srv/ores/deploy-cache/revs/19beee5882382ed2ea92492e1db77b94a5bf2751 is already live (use --force to override)
deploy-service@deployment-sca01:~$ echo $?
0
deploy-service@deployment-sca01:~$ ls -l /srv/ores/
total 8
lrwxrwxrwx 1 deploy-service deploy-service   58 Apr  3 03:06 deploy -> deploy-cache/revs/19beee5882382ed2ea92492e1db77b94a5bf2751
-rwxrwxr-x 1 deploy-service deploy-service    0 Apr  1 03:16 deploy.2016-04-01T03:23:00.175110
drwxrwxr-x 4 deploy-service deploy-service 4096 Apr  3 03:07 deploy-cache
drwxrwxr-x 6 deploy-service deploy-service 4096 Apr  3 03:06 venv

But puppet querying to see if ores/deploy is installed will fail: https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/lib/puppet/provider/package/scap3.rb#L94-L105

deploy-service@deployment-sca01:~$ git -C /srv/ores/deploy tag --points-at HEAD
scap/sync/2016-04-03/0001
deploy-service@deployment-sca01:~$ git -C /srv/deployment/ores/deploy tag --points-at HEAD                                                                                            
fatal: Cannot change to '/srv/deployment/ores/deploy': No such file or directory
deploy-service@deployment-sca01:~$ echo $?
128

Given that puppet fails with an exit code of 70, it seems likely that scap itself is failing (70 is the exception exit code); however, since I'm able to run the command that puppet is nominally running as the user that is nominally running that command, I'm not sure what could be wrong here. @mmodell @dduvall —do you have any thoughts on this one?

The first thing that should likely be done for Ores is to change git_deploy_dir in the scap.cfg to /srv/deployment/ since that's what the puppet provider expects. Unclear if that could be causing this error, but this will certainly cause an error.

@thcipriani I was dealing with this issue while I was working on using scap I thought I solved it. That's what I've got so far: 70 mean unhandled error. Running it from the deploy-service user won't solver the issue, it works but not with puppet (since the puppet is actually logs in with the deploy-service user). With some modifications to the scap itself (that I told you about) I was able to get more signal from puppet runs. It's a known issue with git clone (and it fails when it tries to do the git clone). It happens because the puppet master (or you're running it with puppet agent) can't change directory, so basically if you run the puppet agent from another directory, let's say "/srv" or even simply "/" it should work. I just did that and it worked like a charm. So long story short, you should run the puppet from somewhere else than home directory of the user:

ladsgroup@deployment-sca01:/srv$ sudo puppet agent -tv
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-sca01.deployment-prep.eqiad.wmflabs
Info: Applying configuration version '1460479826'
Error: Could not set home on user[citoid]: Execution of '/usr/sbin/usermod -d /nonexistent citoid' returned 8: usermod: user citoid is currently used by process 455
Error: /Stage[main]/Citoid/Service::Node[citoid]/User[citoid]/home: change from /home/citoid to /nonexistent failed: Could not set home on user[citoid]: Execution of '/usr/sbin/usermod -d /nonexistent citoid' returned 8: usermod: user citoid is currently used by process 455
Notice: /Stage[main]/Citoid/Service::Node[citoid]/File[/var/log/citoid]: Dependency User[citoid] has failures: true
Warning: /Stage[main]/Citoid/Service::Node[citoid]/File[/var/log/citoid]: Skipping because of failed dependencies
Error: Could not set home on user[graphoid]: Execution of '/usr/sbin/usermod -d /nonexistent graphoid' returned 8: usermod: user graphoid is currently used by process 457
Error: /Stage[main]/Graphoid/Service::Node[graphoid]/User[graphoid]/home: change from /home/graphoid to /nonexistent failed: Could not set home on user[graphoid]: Execution of '/usr/sbin/usermod -d /nonexistent graphoid' returned 8: usermod: user graphoid is currently used by process 457
Notice: /Stage[main]/Graphoid/Service::Node[graphoid]/File[/var/log/graphoid]: Dependency User[graphoid] has failures: true
Warning: /Stage[main]/Graphoid/Service::Node[graphoid]/File[/var/log/graphoid]: Skipping because of failed dependencies
Notice: /Stage[main]/Citoid/Service::Node[citoid]/Base::Service_unit[citoid]/Service[citoid]: Dependency User[citoid] has failures: true
Warning: /Stage[main]/Citoid/Service::Node[citoid]/Base::Service_unit[citoid]/Service[citoid]: Skipping because of failed dependencies
Notice: /Stage[main]/Graphoid/Service::Node[graphoid]/Base::Service_unit[graphoid]/Service[graphoid]: Dependency User[graphoid] has failures: true
Warning: /Stage[main]/Graphoid/Service::Node[graphoid]/Base::Service_unit[graphoid]/Service[graphoid]: Skipping because of failed dependencies
Notice: /Stage[main]/Ores::Scapdeploy/Scap::Target[ores/deploy]/Package[ores/deploy]/ensure: created
Notice: Finished catalog run in 65.30 seconds

I hope this would be helpful for you.

I'll be around in IRC if you want to discuss or ask question.

Ladsgroup added a project: Machine-Learning-Team (Active Tasks).Apr 12 2016, 5:02 PM

Krenair renamed this task from deployment-((sca|aqs)01|ores-web) fails due to scap3 errors to Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors.Apr 12 2016, 5:21 PM

Change 282992 had a related patch set uploaded (by Ladsgroup):
scap: A basic workaround for the git clone issue

https://gerrit.wikimedia.org/r/282992

gerritbot added a project: Patch-For-Review.Apr 12 2016, 6:58 PM

I cherry-picked this patch into the beta puppetmaster. So the issue on ORES should be resolved by now.

Change 282992 merged by Alexandros Kosiaris:
scap: A basic workaround for the git clone issue

https://gerrit.wikimedia.org/r/282992

Ladsgroup moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Apr 13 2016, 5:11 PM

Ladsgroup moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.Apr 13 2016, 7:17 PM

Yep, now AQS

Krenair renamed this task from Puppet on deployment-((sca|aqs)01|ores-web) fails due to scap3 errors to deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors.Apr 22 2016, 4:41 AM

Ladsgroup closed this task as Resolved.Apr 26 2016, 3:05 PM

Ladsgroup claimed this task.

Ladsgroup reopened this task as Open.Apr 26 2016, 3:07 PM

Ladsgroup removed a project: Machine-Learning-Team (Active Tasks).

Puppet status:

status	host	detail
	deployment-sca01	Success
	deployment-aqs01	deployment repo missing from deployment-tin: `/srv/deployment/analytics/aqs/deploy` does not exist.

I tried setting up the AQS deploy repository to fix aqs01 but it's missing .git/DEPLOY_HEAD?

In T132267#2241296, @Krenair wrote:

I tried setting up the AQS deploy repository to fix aqs01 but it's missing .git/DEPLOY_HEAD?

That file is created by deploy --init (or by actually running a full deploy). It's still a manual step at the moment.

Change 285535 had a related patch set uploaded (by Alex Monk):
Fix up deployment-prep scap config

https://gerrit.wikimedia.org/r/285535

gerritbot added a project: Patch-For-Review.Apr 26 2016, 10:17 PM

AQS is now fine.

Change 285535 merged by Joal:
Fix up deployment-prep scap config