Page MenuHomePhabricator

Git refusing to clone some ORES submodules
Closed, ResolvedPublic

Description

Every time I try to deploy to our ores* cluster, I'm blocked by this error on each machine:

Cloning into 'submodules/ores'...
Submodule path 'submodules/ores': checked out '2c54f421755349c10b83da4ce35e6ce92d2bfd92'
Cloning into 'submodules/wheels'...
Submodule path 'submodules/wheels': checked out 'd7fa640c59aebc3c080b43516c76235e1fca5cf9'
Cloning into 'submodules/wikiclass'...
error: RPC failed; result=22, HTTP code = 504
fatal: The remote end hung up unexpectedly
Clone of 'https://phabricator.wikimedia.org/source/wikiclass.git' into submodule path 'submodules/wikiclass' failed

I don't think the wikiclass repo is particularly offensive, it's about 460MB of git.

This may be exacerbated by our new parallel scap process.

Event Timeline

awight updated the task description. (Show Details)

This could be another submodule rewriting problem. I see that .gitmodules has been modified on the target machine, to point to http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/draftquality, however the error indicates that "git submodule sync" never happened and we're still trying to swarm Phabricator git.

@thcipriani @mmodell Is the fix for T179013 deployed to production? I'm hoping the fix will be that simple.

I might nudge this up to UBN today, this blocks us from deploying to new hardware and something's up with our old clusters. The reason I'm not bumping priority now is that I'm focused on fixing the issue that's overloading the old hardware.

@thcipriani @mmodell Is the fix for T179013 deployed to production? I'm hoping the fix will be that simple.

I might nudge this up to UBN today, this blocks us from deploying to new hardware and something's up with our old clusters. The reason I'm not bumping priority now is that I'm focused on fixing the issue that's overloading the old hardware.

No the fix is not in production. It's tangled up in a lot of other changes, but effectively it just moves around a git submodule sync so backporting for a minor version bump may not be too involved.

It is noteworthy that this bug only happens when redeploying a revision that has already been deployed with --force. A quick workaround would be to remove the revision directory, e.g., /srv/deployment/ores/deploy-cache/revs/[sha1] from the target.

@thcipriani OK thank you for the workaround. I'll note that I don't have permissions to do that myself, but I'll ask an opsen to do so, or just trivially increment the revision.

I'm not sure what to make of this one. I don't think T179013 ever affected production, so I'm not sure that we can backport a fix for it. If anything we should just push a new scap version with the latest code which should be stable at this point.

I haven't seen this issue in a few weeks, closing. Thank you!