Page MenuHomePhabricator

find a way to systematically update the deployment server name across all repos
Open, HighPublic

Description

After our recent migration of our deployment server from tin to deploy1001 (T175288) there were reports from several users
about being blocked from deploying because files in their local repos still referred to the old deployment server name.

There were different categories of this issue, some were .config files in the "deployment-cache" directory which contained the string "tin.eqiad.wmnet". As the name implies these are cached files. One way to fix the issue was to manually edit the file and replace the host name. Another was apparently to just delete the file and have it recreated by scap and/or running scap with --refresh-config.

Seperate from this there was another category where .config files were not in the deployment-cache directory and still contained the old host name and it has been reported that this happened after a fresh OS install.

Also there were comments about a fix inside scap that is needed for this but still needs to be deployed.

This ticket is for all that and finding a clean way to handle this next time we have to switch from say deploy1001 to deploy1002.

Event Timeline

Dzahn updated the task description. (Show Details)
Joe triaged this task as High priority.Jun 18 2018, 8:41 AM
Vvjjkkii renamed this task from find a way to systematically update the deployment server name across all repos to 1taaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
thcipriani added a subscriber: thcipriani.

There are a couple of different issues here.

The tin and deployment-cache issues that came up after the initial move to deploy1001 are fixed by the 3.8.2-1 scap release. These issue were due to repositories with submodules issuing a fetch (recursively) before remapping submodules to look at the deployment server (see T196663#4265139 for details). That is fixed.

The second issue is that some repositories override deployment_server that is tracked in T162814. The only outstanding patch I'm aware of is for 3d2png (https://gerrit.wikimedia.org/r/#/c/3d2png/deploy/+/441234/).

Mentioned in SAL (#wikimedia-operations) [2019-08-27T11:51:31Z] <mutante> miscweb1001 - manually remove tin.eqiad.wmnet (!) from /srv/iegreview/iegreview-cache/.config and replace with deploy1001 after first puppet run. still existing bug that tin is not fully removed (T224247, T175288, T197470)

This issue reoccured when we moved to deploy1002 for a bunch of services that use deploy-local for initial deployments via puppet. There's an open change relating to it here https://gerrit.wikimedia.org/r/c/operations/puppet/+/670784

Just ran into this today on an install of new snapshot1011,12,13: got the dreaded

Apr 28 11:47:52 snapshot1011 puppet-agent[8409]: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007
Apr 28 11:47:52 snapshot1011 puppet-agent[8409]: (/Stage[main]/Profile::Dumps::Generation::Worker::Common/Scap::Target[dumps/dumps]/Package[dumps/dumps]/ensure) change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007

for the dumps repo. I have manually edited the dumps-cache/config files on those hosts and left the DEPLOY_HEAD file in the dumps repo on deploy1002 untouched, so that any proposed solution can be tested there. I have two more hosts yet to roll out, so we can defintitely check what works.

I also see the above with --refresh-config added:

root@snapshot1011:~# sudo -u dumpsgen /usr/bin/scap deploy-local --refresh-config --repo dumps/dumps -D log_json:False
13:20:24 Fetch from: http://deploy1001.eqiad.wmnet/dumps/dumps/.git
13:20:24 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 348, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 156, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 300, in fetch
    git.fetch(self.context.cache_dir, git_remote)
  File "/usr/lib/python2.7/dist-packages/scap/git.py", line 371, in fetch
    gitcmd("clone", *cmd)
  File "/usr/lib/python2.7/dist-packages/scap/runcmd.py", line 82, in gitcmd
    return _runcmd(["git", subcommand] + list(args), **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/runcmd.py", line 72, in _runcmd
    raise FailedCommand(" ".join(argv), p.returncode, stdout, stderr)
FailedCommand: Command 'git clone --jobs 70 http://deploy1001.eqiad.wmnet/dumps/dumps/.git /srv/deployment/dumps/dumps-cache/cache' failed with exit code 128; stderr:
Cloning into '/srv/deployment/dumps/dumps-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/dumps/dumps/.git/': Could not resolve host: deploy1001.eqiad.wmnet

13:20:24 deploy-local failed: <FailedCommand> Command 'git clone --jobs 70 http://deploy1001.eqiad.wmnet/dumps/dumps/.git /srv/deployment/dumps/dumps-cache/cache' failed with exit code 128; stderr:
Cloning into '/srv/deployment/dumps/dumps-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/dumps/dumps/.git/': Could not resolve host: deploy1001.eqiad.wmnet

Interestingly when editing the cached copy of the repo .config file, using --refresh-config will actually pull the config from the deploy server, restoring the old hostname.

Hit this today when setting up new thumbor servers. What I don't really understand is where it's getting deploy1001 these days:

Notice: /Stage[main]/Scap/Package[scap]/ensure: created
Notice: /Stage[main]/Scap/File[/etc/scap.cfg]/ensure: defined content as '{md5}dea3d6b74e34fba71d5e575c7999c87a'
Notice: /Stage[main]/Scap/Package[python-psutil]/ensure: created
Notice: /Stage[main]/Scap/Package[python-netifaces]/ensure: created
Notice: /Stage[main]/Scap/Package[python-yaml]/ensure: created
Notice: /Stage[main]/Scap/Package[python-requests]/ensure: created
Notice: /Stage[main]/Scap/Package[python-jinja2]/ensure: created
Notice: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/Group[mwdeploy]/ensure: created
Notice: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/User[mwdeploy]/ensure: created
Notice: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/File[/var/lib/mwdeploy]/ensure: created
Error: Execution of '/usr/bin/scap deploy-local --repo 3d2png/deploy -D log_json:False' returned 70: 
Error: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/Package[3d2png/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo 3d2png/deploy -D log_json:False' returned 70:

scap.cfg had the correct deploy1002 host and was created before the deploy-local attempt.

Also, why do we need to use specific server hostnames? deployment.eqiad.wmnet will always point to the active deploy server, regardless of DC (yes, the "eqiad' part is wrong)

Hit this today when setting up new thumbor servers. What I don't really understand is where it's getting deploy1001 these days:

It's the default coded into scap itself:
https://gerrit.wikimedia.org/g/mediawiki/tools/scap/+/10ae5db4e1911208ed01e6ca2841a71fc6e892e9/scap/config.py#70

This happens because of how DEPLOY_HEAD retains the last-used deploy server name and unless explicitly told to ignore, it will use it after the first clone:

grep deploy1001 /srv/deployment/3d2png/deploy/.git/DEPLOY_HEAD
git_server: deploy1001.eqiad.wmnet

The deploy server can be overridden on the command line https://gerrit.wikimedia.org/r/c/operations/puppet/+/670784/1#message-96812b744409eea5e74a934fc912834fed0e7e9b

For the immediate term if there are no objections I will replace all instances of git_server: deploy1001.eqiad.wmnet with git_server: deploy1002.eqiad.wmnet on deploy1002 (there are about 27 with the old host defined)

For the immediate term if there are no objections I will replace all instances of git_server: deploy1001.eqiad.wmnet with git_server: deploy1002.eqiad.wmnet on deploy1002 (there are about 27 with the old host defined)

No objection.

For the immediate term if there are no objections I will replace all instances of git_server: deploy1001.eqiad.wmnet with git_server: deploy1002.eqiad.wmnet on deploy1002 (there are about 27 with the old host defined)

No objection.

This is done