Page MenuHomePhabricator

"scap deploy"'s config-deploy should check for broken symlinks
Open, In Progress, Needs TriagePublic

Description

We've been provisioning new bullseye wdqs hosts and ran into some issues with scap; namely "incomplete" deploys that leave the host in a non-functional state.

Scap works as we expect when we run with the --force flag:

scap deploy --force -l 'wdqs2016.codfw.wmnet' ${VERSION}

However if we run it without the --force flag, and the directory /srv/deployment/wdqs/wdqs-cache/revs/$CURRENT_DEPLOY_COMMIT_HASH/.git/config-files/etc/query_service is not present, scap will silently "fail" the deploy but not raise an error:

Here's a log from wdqs2016 where its .git/config-files/etc/query_service directory has been manually deleted by the operator:
https://phabricator.wikimedia.org/P49572

And here's a log from the same host, same directory missing, but with the --force flag:
https://phabricator.wikimedia.org/P49573


I think based off this log message:
{"name": "target.wdqs2016.codfw.wmnet.deploy-local", "msg": "Revision directory already exists (use --force to override)", "args": [], "levelno": 20, "filename": "deploy.py", "exc_text": null, "lineno": 336, "funcName": "fetch", "created": 1689706482.8972232, "msecs": 897.2232341766357, "relativeCreated": 743.9661026000977, "host": "wdqs2016.codfw.wmnet"}

The issue is possibly that scap is happy as long as /srv/deployment/wdqs/wdqs-cache/revs/$CURRENT_DEPLOY_COMMIT_HASH/ exists, but doesn't check that all relevant subdirectories are as they should.

AC

  • When necessary subdirectories of the main git directory are not present and/or have different contents than expected, scap will actually raise an error when the --force flag is not present

Event Timeline

First draft of this ticket up. There's a couple things that aren't perfect:

(1) I think the current description is a bit confusing, I'm not sure if another team has enough context in the ticket's current state.

(2) I haven't tagged the team who would work this ticket. Probably rel-eng?

I think I have the context to understand this.

It looks like /srv/deployment/wdqs/wdqs-cache/revs/$CURRENT_DEPLOY_COMMIT_HASH/.git/config-files/etc/query_service is symlinked at /etc/query_service/ldf-config.json is that true?

I see this line in log output:

/etc/query_service/ldf-config.json is already linked to current rev(use --force to override)

Which is exactly what you're describing. It comes from the check in scap deploy here: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/deploy.py#L212

The assumption is that if a file is already symlinked, there's no need to regenerate the file. But it sounds like it's a bad assumption in this case, is that true?

@thcipriani that's correct. For context, we are manually deleting the directory because puppet runs scap before git-fat is installed, and it results in an incomplete deploy (more details here ) . The deletion results in a broken symlink, maybe scap could check for that?

Also, if we try and run scap before deleting these directories, scap does not detect the missing files (large files provided by git-fat) and reports a successful deployment.

I'm not trying to cross the streams with what might be 2 different bugs (or even expected behavior), just trying to explain why we're manually deleting that directory.

I think I have the context to understand this.

It looks like /srv/deployment/wdqs/wdqs-cache/revs/$CURRENT_DEPLOY_COMMIT_HASH/.git/config-files/etc/query_service is symlinked at /etc/query_service/ldf-config.json is that true?

I see this line in log output:

/etc/query_service/ldf-config.json is already linked to current rev(use --force to override)

Which is exactly what you're describing. It comes from the check in scap deploy here: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/deploy.py#L212

The assumption is that if a file is already symlinked, there's no need to regenerate the file. But it sounds like it's a bad assumption in this case, is that true?

Right, I think we want to check that the file associated with the symlink actually exists, in addition to checking that the symlink itself is present.

Config-deploy/symlink checking

@thcipriani that's correct. For context, we are manually deleting the directory because puppet runs scap before git-fat is installed, and it results in an incomplete deploy (more details here ) . The deletion results in a broken symlink, maybe scap could check for that?

Yeah, I think that makes sense for this task

Right, I think we want to check that the file associated with the symlink actually exists, in addition to checking that the symlink itself is present.

Agreed.

I'll re-title this task to reflect this symlink-checking bug in scap deploy.

Other parts of what you're seeing could use a different ticket. There are a few issues there. Summary below.


Other issues

Also, if we try and run scap before deleting these directories, scap does not detect the missing files (large files provided by git-fat) and reports a successful deployment.

I'm not trying to cross the streams with what might be 2 different bugs (or even expected behavior), just trying to explain why we're manually deleting that directory.

That makes sense. Scap tries to avoid doing work where possible, so exiting early is "expected" But(!) it sounds like some of the checks are bad/confusing—which is a bug. Possibly one of several bugs here 😅

Additional thoughts:

Initial run (as I understand it from T331300#8983005)

  1. Puppet runs scap
  2. scap's git checkout succeeds
  3. git checkout triggers the smudge filter, which calls git-fat
  4. git-fat fails (since the executable doesn't exist)

Is that right?

Subsequent scap deploy runs

  1. scap deploy scap doesn't re-try the git checkout (unless you use --force) due to this check (which is different than the symlink check above. The logic of this check is: git rev-parse --verify HEAD == <rev you're trying to deploy> ? nothing to do && exit )
  2. Smudge is never triggered
  3. git-fat is never called to do it's thing
  4. git-fat never pulls the actual files

Bugs in this process

I suppose there are a few problems:
0. scap is trying to be too smart/helpful and exiting in confusing ways

  1. git-fat should exist before the scap run, so there's some missing dependency/race condition/erroneous graph in the initial puppet run
  2. scap should probably check if git-fat was successful, otherwise the bad initial puppet run is unrecoverable without manual intervention
  3. maybe we caused 1 because we're phasing out git-fat in favor of git-lfs.

Anyway, this one is more gnarly than adding a symlink check—let's make a separate task. :)

thcipriani renamed this task from Scap deploy does not raise error on missing directory to "scap deploy"'s config-deploy should check for broken symlinks.Jul 18 2023, 10:20 PM
Sandeeps changed the task status from Open to In Progress.Mar 26 2024, 5:41 PM