Page MenuHomePhabricator

RESTBase scap deployment failed
Open, Needs TriagePublic

Description

Our last scap deployment for RESTBase failed with the following errors:

== CANARY1 ==
:* restbase2030.codfw.wmnet
15:23:27 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'restbase/deploy', '-g', 'canary', 'fetch', '--refresh-config'] (ran as deploy-service@restbase2030.codfw.wmnet) returned [70]: Registering scripts in directory '/srv/deployment/restbase/deploy-cache/revs/c4d19d7c4e2b4e8d16cd12677d44b42962c05917/scap/scripts'
Fetch from: http://deploy1002.eqiad.wmnet/restbase/deploy/.git
Running ['git', 'remote', 'set-url', 'origin', 'http://deploy1002.eqiad.wmnet/restbase/deploy/.git'] with {'cwd': '/srv/deployment/restbase/deploy-cache/cache', 'stdout': -1, 'stderr': -1, 'text': True, 'stdin': -3}
Running ['git', 'fetch', '--tags', '--jobs', '62', '--no-recurse-submodules'] with {'cwd': '/srv/deployment/restbase/deploy-cache/cache', 'stdout': -1, 'stderr': -1, 'text': True, 'stdin': -3}
Command exited with code 1
Unhandled error:
deploy-local failed: <FailedCommand> {'exitcode': 1, 'stdout': '', 'stderr': 'From http://deploy1002.eqiad.wmnet/restbase/deploy/\n   e5ed8d0f..c4d19d7c  master                    -> origin/master\n ! [rejected]          scap/sync/2024-02-19/0011 -> scap/sync/2024-02-19/0011  (would clobber existing tag)\n * [new tag]           scap/sync/2024-03-25/0001 -> scap/sync/2024-03-25/0001\n * [new tag]           scap/sync/2024-03-25/0002 -> scap/sync/2024-03-25/0002\n * [new tag]           scap/sync/2024-03-25/0003 -> scap/sync/2024-03-25/0003\n * [new tag]           scap/sync/2024-04-02/0001 -> scap/sync/2024-04-02/0001\n'}

I think the problem is this and needs some manual intervention:

would clobber existing tag
stderr
From http://deploy1002.eqiad.wmnet/restbase/deploy/
   e5ed8d0f..c4d19d7c  master                    -> origin/master
 ! [rejected]          scap/sync/2024-02-19/0011 -> scap/sync/2024-02-19/0011  (would clobber existing tag)
 * [new tag]           scap/sync/2024-03-25/0001 -> scap/sync/2024-03-25/0001  
 * [new tag]           scap/sync/2024-03-25/0002 -> scap/sync/2024-03-25/0002
 * [new tag]           scap/sync/2024-03-25/0003 -> scap/sync/2024-03-25/0003
 * [new tag]           scap/sync/2024-04-02/0001 -> scap/sync/2024-04-02/0001

Event Timeline

hashar added subscribers: dancy, hashar.

Recentish git versions refuse to update a tag unless forced.

That is supposedly fixed by e98df2095d16c1cba159f425f77c263b243f1a7e / T311336 :(

The faulty tag has the date 2024-02-19 and the fix in scap was written in Feb 23 and I guess deployed not so long after but definitely AFTER the faulty tag. Once the fault tag is present, it keeps producing the issue. It has to be manually removed from the targetted hosts (under /srv/deployment/restbase/deploy-cache/cache do a git fetch --tags -f as whatever user owns the repo).

Alternatively: remove the faulty tag from the repository on the deployment server: git tag -d scap/sync/2024-02-19/0011. Once it is no more on the deployment server, the targets will not fetch it and would thus no more complain about the tag suddenly changing.

Thanks @hashar

jgiannelos@deploy1002:/srv/deployment/restbase/deploy$ git tag -d scap/sync/2024-02-19/0011
Deleted tag 'scap/sync/2024-02-19/0011' (was d3425717)

I will try to deploy on Monday.

It failed again but with a different error:

:* restbase1030.eqiad.wmnet

Setting lfs.url of restbase to https://gerrit.wikimedia.org/r/mediawiki/services/restbase/info/lfs
Running ['git', 'config', 'lfs.url', 'https://gerrit.wikimedia.org/r/mediawiki/services/restbase/info/lfs'] with {'cwd': '/srv/deployment/restbase/deploy-cache/cache/restbase', 'stdout': -1, 'stderr': -1, 'text': True, 'stdin': -3}
Unhandled error:
deploy-local failed: <FileNotFoundError> {}

Mentioned in SAL (#wikimedia-operations) [2024-04-08T15:03:55Z] <dancy@deploy1002> Started deploy [restbase/deploy@c4d19d7]: testing T361608

Mentioned in SAL (#wikimedia-operations) [2024-04-08T15:17:54Z] <dancy@deploy1002> Finished deploy [restbase/deploy@c4d19d7]: testing T361608 (duration: 13m 59s)

It failed again but with a different error:

:* restbase1030.eqiad.wmnet

Setting lfs.url of restbase to https://gerrit.wikimedia.org/r/mediawiki/services/restbase/info/lfs
Running ['git', 'config', 'lfs.url', 'https://gerrit.wikimedia.org/r/mediawiki/services/restbase/info/lfs'] with {'cwd': '/srv/deployment/restbase/deploy-cache/cache/restbase', 'stdout': -1, 'stderr': -1, 'text': True, 'stdin': -3}
Unhandled error:
deploy-local failed: <FileNotFoundError> {}

I resolved this by deleting the empty /srv/deployment/restbase/deploy-cache/revs/c4d19d7c4e2b4e8d16cd12677d44b42962c05917 directory on restbase1030.eqiad.wmnet. The date on the directory was April 3, the time of the original problem reported in this ticket. It's unclear to me how we ended up with an empty directory, though.

Here's the log of the deployment after the cleanup: P59861