Maniphest T197470

find a way to systematically update the deployment server name across all repos
Open, LowPublic
Actions

Assigned To

None

Authored By

	Dzahn
	Jun 15 2018, 2:46 PM

Description

After our recent migration of our deployment server from tin to deploy1001 (T175288) there were reports from several users
about being blocked from deploying because files in their local repos still referred to the old deployment server name.

There were different categories of this issue, some were .config files in the "deployment-cache" directory which contained the string "tin.eqiad.wmnet". As the name implies these are cached files. One way to fix the issue was to manually edit the file and replace the host name. Another was apparently to just delete the file and have it recreated by scap and/or running scap with --refresh-config.

Seperate from this there was another category where .config files were not in the deployment-cache directory and still contained the old host name and it has been reported that this happened after a fresh OS install.

Also there were comments about a fix inside scap that is needed for this but still needs to be deployed.

This ticket is for all that and finding a clean way to handle this next time we have to switch from say deploy1001 to deploy1002.

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Remove obsolete defaults for git_server	repos/releng/scap!256	hashar	cfg_git_server	master

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T197470 find a way to systematically update the deployment server name across all repos
		Resolved		thcipriani	T162814 Ensure deployment_server is global

Event Timeline

Dzahn created this task.Jun 15 2018, 2:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 15 2018, 2:46 PM

Dzahn edited projects, added Scap; removed Deployments.Jun 15 2018, 2:47 PM

Dzahn updated the task description. (Show Details)

Dzahn added a project: Release-Engineering-Team.Jun 15 2018, 2:49 PM

Joe triaged this task as High priority.Jun 18 2018, 8:41 AM

Joe subscribed.Jun 18 2018, 10:47 AM

• Vvjjkkii renamed this task from find a way to systematically update the deployment server name across all repos to 1taaaaaaaa.Jul 1 2018, 1:03 AM

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

CommunityTechBot renamed this task from 1taaaaaaaa to find a way to systematically update the deployment server name across all repos.Jul 2 2018, 12:07 PM

CommunityTechBot updated the task description. (Show Details)

CommunityTechBot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

CommunityTechBot added a subscriber: Aklapper.

thcipriani added a subtask: T162814: Ensure deployment_server is global.Jul 10 2018, 11:38 PM

There are a couple of different issues here.

The tin and deployment-cache issues that came up after the initial move to deploy1001 are fixed by the 3.8.2-1 scap release. These issue were due to repositories with submodules issuing a fetch (recursively) before remapping submodules to look at the deployment server (see T196663#4265139 for details). That is fixed.

The second issue is that some repositories override deployment_server that is tracked in T162814. The only outstanding patch I'm aware of is for 3d2png (https://gerrit.wikimedia.org/r/#/c/3d2png/deploy/+/441234/).

thcipriani closed subtask T162814: Ensure deployment_server is global as Resolved.Aug 29 2018, 6:31 PM

greg moved this task from INBOX to Backlog on the Release-Engineering-Team board.Nov 21 2018, 12:34 AM

greg edited projects, added Release-Engineering-Team (Backlog); removed Release-Engineering-Team.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:06 PM

• Phabricator_maintenance edited projects, added Release-Engineering-Team-TODO; removed Release-Engineering-Team (Backlog).Jun 12 2019, 11:51 PM

• Phabricator_maintenance moved this task from Should be empty (use Release-Engineering-Team) to Later / Need volunteer on the Release-Engineering-Team-TODO board.Jun 12 2019, 11:55 PM

greg added a project: Release-Engineering-Team.Jun 21 2019, 10:35 PM

greg edited projects, added Release-Engineering-Team (Deployment services); removed Release-Engineering-Team.Jun 24 2019, 9:20 PM

Mentioned in SAL (#wikimedia-operations) [2019-08-27T11:51:31Z] <mutante> miscweb1001 - manually remove tin.eqiad.wmnet (!) from /srv/iegreview/iegreview-cache/.config and replace with deploy1001 after first puppet run. still existing bug that tin is not fully removed (T224247, T175288, T197470)

Dzahn mentioned this in T245757: Upgrade MediaWiki clusters to Debian Buster (debian 10).Mar 30 2021, 6:56 PM

thcipriani removed a project: Release-Engineering-Team (Deployment services).Apr 20 2021, 1:10 AM

thcipriani edited projects, added Release-Engineering-Team (thcipriani-workboard-fiddling); removed Release-Engineering-Team-TODO.Apr 20 2021, 3:42 AM

thcipriani moved this task from thcipriani-workboard-fiddling to Seen (ARCHIVE) on the Release-Engineering-Team board.Apr 20 2021, 3:54 AM

thcipriani edited projects, added Release-Engineering-Team; removed Release-Engineering-Team (thcipriani-workboard-fiddling).

thcipriani edited projects, added Release-Engineering-Team (Seen); removed Release-Engineering-Team.Apr 20 2021, 3:23 PM

This issue reoccured when we moved to deploy1002 for a bunch of services that use deploy-local for initial deployments via puppet. There's an open change relating to it here https://gerrit.wikimedia.org/r/c/operations/puppet/+/670784

Just ran into this today on an install of new snapshot1011,12,13: got the dreaded

Apr 28 11:47:52 snapshot1011 puppet-agent[8409]: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007
Apr 28 11:47:52 snapshot1011 puppet-agent[8409]: (/Stage[main]/Profile::Dumps::Generation::Worker::Common/Scap::Target[dumps/dumps]/Package[dumps/dumps]/ensure) change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007

for the dumps repo. I have manually edited the dumps-cache/config files on those hosts and left the DEPLOY_HEAD file in the dumps repo on deploy1002 untouched, so that any proposed solution can be tested there. I have two more hosts yet to roll out, so we can defintitely check what works.

ArielGlenn mentioned this in T281330: deploy three new snapshots as replacements for snapshot1005,6,7 and set 1005,6,7 as spare.Apr 28 2021, 1:00 PM

I also see the above with --refresh-config added:

root@snapshot1011:~# sudo -u dumpsgen /usr/bin/scap deploy-local --refresh-config --repo dumps/dumps -D log_json:False
13:20:24 Fetch from: http://deploy1001.eqiad.wmnet/dumps/dumps/.git
13:20:24 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 348, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 156, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 300, in fetch
    git.fetch(self.context.cache_dir, git_remote)
  File "/usr/lib/python2.7/dist-packages/scap/git.py", line 371, in fetch
    gitcmd("clone", *cmd)
  File "/usr/lib/python2.7/dist-packages/scap/runcmd.py", line 82, in gitcmd
    return _runcmd(["git", subcommand] + list(args), **kwargs)
  File "/usr/lib/python2.7/dist-packages/scap/runcmd.py", line 72, in _runcmd
    raise FailedCommand(" ".join(argv), p.returncode, stdout, stderr)
FailedCommand: Command 'git clone --jobs 70 http://deploy1001.eqiad.wmnet/dumps/dumps/.git /srv/deployment/dumps/dumps-cache/cache' failed with exit code 128; stderr:
Cloning into '/srv/deployment/dumps/dumps-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/dumps/dumps/.git/': Could not resolve host: deploy1001.eqiad.wmnet

13:20:24 deploy-local failed: <FailedCommand> Command 'git clone --jobs 70 http://deploy1001.eqiad.wmnet/dumps/dumps/.git /srv/deployment/dumps/dumps-cache/cache' failed with exit code 128; stderr:
Cloning into '/srv/deployment/dumps/dumps-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/dumps/dumps/.git/': Could not resolve host: deploy1001.eqiad.wmnet

Interestingly when editing the cached copy of the repo .config file, using --refresh-config will actually pull the config from the deploy server, restoring the old hostname.

Hit this today when setting up new thumbor servers. What I don't really understand is where it's getting deploy1001 these days:

Notice: /Stage[main]/Scap/Package[scap]/ensure: created
Notice: /Stage[main]/Scap/File[/etc/scap.cfg]/ensure: defined content as '{md5}dea3d6b74e34fba71d5e575c7999c87a'
Notice: /Stage[main]/Scap/Package[python-psutil]/ensure: created
Notice: /Stage[main]/Scap/Package[python-netifaces]/ensure: created
Notice: /Stage[main]/Scap/Package[python-yaml]/ensure: created
Notice: /Stage[main]/Scap/Package[python-requests]/ensure: created
Notice: /Stage[main]/Scap/Package[python-jinja2]/ensure: created
Notice: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/Group[mwdeploy]/ensure: created
Notice: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/User[mwdeploy]/ensure: created
Notice: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/File[/var/lib/mwdeploy]/ensure: created
Error: Execution of '/usr/bin/scap deploy-local --repo 3d2png/deploy -D log_json:False' returned 70: 
Error: /Stage[main]/Threedtopng::Deploy/Scap::Target[3d2png/deploy]/Package[3d2png/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo 3d2png/deploy -D log_json:False' returned 70:

scap.cfg had the correct deploy1002 host and was created before the deploy-local attempt.

Also, why do we need to use specific server hostnames? deployment.eqiad.wmnet will always point to the active deploy server, regardless of DC (yes, the "eqiad' part is wrong)

In T197470#7505335, @Legoktm wrote:

Hit this today when setting up new thumbor servers. What I don't really understand is where it's getting deploy1001 these days:

It's the default coded into scap itself:
https://gerrit.wikimedia.org/g/mediawiki/tools/scap/+/10ae5db4e1911208ed01e6ca2841a71fc6e892e9/scap/config.py#70

This happens because of how DEPLOY_HEAD retains the last-used deploy server name and unless explicitly told to ignore, it will use it after the first clone:

grep deploy1001 /srv/deployment/3d2png/deploy/.git/DEPLOY_HEAD
git_server: deploy1001.eqiad.wmnet

The deploy server can be overridden on the command line https://gerrit.wikimedia.org/r/c/operations/puppet/+/670784/1#message-96812b744409eea5e74a934fc912834fed0e7e9b

For the immediate term if there are no objections I will replace all instances of git_server: deploy1001.eqiad.wmnet with git_server: deploy1002.eqiad.wmnet on deploy1002 (there are about 27 with the old host defined)

In T197470#7506161, @hnowlan wrote:

For the immediate term if there are no objections I will replace all instances of git_server: deploy1001.eqiad.wmnet with git_server: deploy1002.eqiad.wmnet on deploy1002 (there are about 27 with the old host defined)

No objection.

• dancy edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team (Seen).Nov 16 2021, 5:02 PM

In T197470#7507203, @dancy wrote:

In T197470#7506161, @hnowlan wrote:

For the immediate term if there are no objections I will replace all instances of git_server: deploy1001.eqiad.wmnet with git_server: deploy1002.eqiad.wmnet on deploy1002 (there are about 27 with the old host defined)

No objection.

This is done

jbond removed a project: SRE.Nov 4 2022, 1:20 PM

thcipriani edited projects, added Release-Engineering-Team (Onboarding 🚀); removed Release-Engineering-Team (Priority Backlog 📥).Jun 9 2023, 9:04 PM

The workaround for this is updating the git_server in <deployment-repo>/.git/DEPLOY_HEAD on the new deployment server.

The fix in scap is probably to have it stop paying attention to this value and instead use the puppet-provided value.

kamila subscribed.Sep 22 2023, 9:35 AM

kamila mentioned this in T348990: Simplify switchover of deployment server.Oct 16 2023, 2:29 PM

thcipriani edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team (Onboarding 🚀).Nov 21 2023, 7:18 PM

For the record, I used sudo bash -c 'find /srv/deployment -name DEPLOY_HEAD | xargs sed -i "s/git_server: deploy1002.eqiad.wmnet/git_server: deploy2002.codfw.wmnet/"' (or the other way around) on all deployment servers to work around this.