Page MenuHomePhabricator

Graphoid deploy on beta cluster is failing / salt problems in Beta
Closed, ResolvedPublic

Description

I tried to sync graphoid from deployment-bastion.deployment-prep.eqiad.wmflabs:/srv/deployment/graphoid/deploy via git deploy sync, and it keep showing me 0/2. Detailed report showed that two servers were deployment-sca01 and deployment-sca02, whereas I expected scb.

deployment-bastion:/srv/deployment/graphoid/deploy$ git deploy report sync --detailed
Repo: graphoid/deploy
Tag: graphoid/deploy-sync-20160128-102408

0/2 minions completed fetch; 0/2 minions completed checkout

Details:

deployment-sca01.deployment-prep.eqiad.wmflabs: 
	fetch status: 0 [started: 236 mins ago, last-return: 919 mins ago]
	checkout status: 0 [started: 747 mins ago, last-return: 918 mins ago]
deployment-sca02.deployment-prep.eqiad.wmflabs: 
	fetch status: 0 [started: 236 mins ago, last-return: 919 mins ago]
	checkout status: 0 [started: 747 mins ago, last-return: 918 mins ago]
$ redis-cli SMEMBERS "deploy:graphoid/deploy:minions"
1) "deployment-sca01.deployment-prep.eqiad.wmflabs"
2) "deployment-sca02.deployment-prep.eqiad.wmflabs"

Event Timeline

Yurik assigned this task to mobrovac.
Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added subscribers: Yurik, GWicke.
hashar set Security to None.

I have updated the task details to show the git deploy status and the list of minions registered in Trebuchet redis server.

The Trebuchet target are registered via the puppet class role::graphoid which is applied on deployment-sca01 and deployment-sca02. That eventually includes service::node which uses package { provider => 'trebuchet' }.

There is no scb instance.


Unrelated, but puppet hiera has restbase::graphoid_uri solely pointing to sca01:

In hieradata/labs/deployment-prep/common.yaml at line 91
    restbase::graphoid_uri: http://deployment-sca01.deployment-prep.eqiad.wmflabs:19000

And the saltstack grain deployment_target seems to be properly setup:

root@deployment-salt:~ # salt 'deployment-sca0*' grains.get 'deployment_target'
deployment-sca02.deployment-prep.eqiad.wmflabs:
    - citoid/deploy
    - graphoid/deploy
deployment-sca01.deployment-prep.eqiad.wmflabs:
    - citoid/deploy
    - graphoid/deploy

On deployment-sca01 in /var/log/salt/minion:

2016-01-28 15:09:45,290 [salt.minion      ][ERROR   ] The return failed for job 20160128150944662238 global name '__pillar__' is not defined
2016-01-28 15:09:45,291 [salt.minion      ][ERROR   ] Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1089, in _thread_return
    )](ret)
  File "/var/cache/salt/minion/extmods/returners/deploy_redis.py", line 53, in returner
    serv = _get_serv()
  File "/var/cache/salt/minion/extmods/returners/deploy_redis.py", line 28, in _get_serv
    deployment_config = __pillar__.get('deployment_config')
NameError: global name '__pillar__' is not defined

Sounds bad :-D

On sca01:

# ls -l /var/cache/salt/minion/extmods/returners/deploy_redis.py*
-rw-r--r-- 1 root root 2947 Jan 28 01:43 /var/cache/salt/minion/extmods/returners/deploy_redis.py
-rw-r--r-- 1 root root 3166 Jan 28 01:51 /var/cache/salt/minion/extmods/returners/deploy_redis.pyc

I have pruned the cache from deployment-salt :

salt 'deployment-sca0*' cmd.run 'rm -fR /var/cache/salt/minion/*'
salt 'deployment-sca0*' cmd.run '/etc/init.d/salt-minion restart'
salt 'deployment-sca0*' saltutil.sync_all

I tried deploying again and the minion log on sca01 reports the same trace NameError: global name '__pillar__' is not defined.

This seems to be an issue related to reporting, because the code is actually deployed on those servers.

mobrovac renamed this task from Graphoid deploy on beta cluster is failing to Graphoid deploy on beta cluster is failing / salt problems in Beta.Jan 28 2016, 6:44 PM
mobrovac removed mobrovac as the assignee of this task.
mobrovac triaged this task as Medium priority.
mobrovac added a subscriber: thcipriani.

I first tried a fetch of the repo on deployment-sca01, which seemed just fine:

thcipriani@deployment-sca01:~$ sudo tail /var/log/salt/minion                                                                                                                         
^C
thcipriani@deployment-sca01:~$ sudo salt-call deploy.fetch 'graphoid/deploy'                                                                                                          
[INFO    ] Executing command '/usr/bin/git fetch' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git fetch --tags' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git submodule status --quiet' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git checkout .gitmodules' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git config remote.origin.url' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git submodule sync' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git submodule foreach --recursive git fetch' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git submodule foreach --recursive git fetch --tags' in directory '/srv/deployment/graphoid/deploy'
[INFO    ] Executing command '/usr/bin/git show-ref refs/tags/graphoid/deploy-sync-20160128-203831' in directory '/srv/deployment/graphoid/deploy'
local:
    ----------
    dependencies:
    repo:
        graphoid/deploy
    status:
        0
    tag:
        graphoid/deploy-sync-20160128-203831

So the actual deploy.fetch command is working.

I thought the culprit must then be the redis returner.

I can get to the redis instance on deployment-prep:

thcipriani@deployment-sca01:~$ nc -vz deployment-bastion.deployment-prep.eqiad.wmflabs -w1 6379
deployment-bastion.deployment-prep.eqiad.wmflabs [10.68.16.58] 6379 (?) open

So I thought, maybe the deployment is just getting recorded wrong, so I went through a "fetch" for graphoid

thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ git deploy start
Deployment started.
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ git deploy sync
Repo: graphoid/deploy
Tag: graphoid/deploy-sync-20160128-203831

0/2 minions completed fetch
Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): n
Not continuing to checkout phase. A deployment is still underway, please finish, sync, or abort.
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ git deploy abort
Deployment aborted.
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ redis-cli hget deploy:graphoid/deploy:minions:deployment-sca01.deployment-prep.eqiad.wmflabs 'fetch_checkin_timestamp' 
"1454013516.869637"
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ date -d@$(redis-cli hget deploy:graphoid/deploy:minions:deployment-sca01.deployment-prep.eqiad.wmflabs 'fetch_checkin_timestamp')                                                                                                                                                                          
Thu Jan 28 20:38:36 UTC 2016
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ date
Thu Jan 28 20:41:16 UTC 2016

So what I've found:

  1. The fetch seems to actually work
  2. The recording of the timestamp of the fetch to the redis returner seems to work
  3. The output back to the git-deploy client does not work.

Still digging...

So it seems that the redis fetch_tag is not being updated, even thought the fetch_checkin_timestamp gets updated on deploy, which is why the reporter is showing 0/2:

thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ jq '.' < .git/deploy/deploy                                                                                            
{
  "time": "20160128-203831",
  "user": "Marko Obrovac",
  "tag": "graphoid/deploy-sync-20160128-203831",
  "sync-time": "20160128-203831"
}
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ redis-cli hget deploy:graphoid/deploy:minions:deployment-sca01.deployment-prep.eqiad.wmflabs 'fetch_tag'
"graphoid/deploy-sync-20160126-230523"
thcipriani claimed this task.

Got it working through some combination of:

  1. sudo salt '*' saltutil.refresh_pillar
  2. sudo salt '*' saltutil.sync_all
  3. Reinstalling salt-minion on the two hosts
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ git deploy sync
Repo: graphoid/deploy
Tag: graphoid/deploy-sync-20160129-000958

2/2 minions completed fetch
Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): n
Not continuing to checkout phase. A deployment is still underway, please finish, sync, or abort.
thcipriani@deployment-bastion:/srv/deployment/graphoid/deploy$ git deploy abort
Deployment aborted.

I guess salt got upgraded / not restarted / corrupted somehow :-/ Thank you @thcipriani , I was myself out of ideas.