Page MenuHomePhabricator

ORES deploy submodule 504
Closed, ResolvedPublic

Description

1awight@tin:/srv/deployment/ores/deploy$ scap deploy -f -l "ores1002.eqiad.wmnet" "Smoke test for timeout bug on ores1002 (non-production)"
219:55:28 Started deploy [ores/deploy@a0f7d5c]
3Deleted tag 'scap/sync/2017-09-12/0001' (was f75e5ca)
4Entering 'submodules/draftquality'
5Entering 'submodules/editquality'
6Entering 'submodules/ores'
7Entering 'submodules/wheels'
8Entering 'submodules/wikiclass'
919:55:28 Started deploy [ores/deploy@a0f7d5c]: Smoke test for timeout bug on ores1002 (non-production)
1019:55:28
11== CLUSTER ==
12:* ores1002.eqiad.wmnet
1319:58:29 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '--force', '-g', 'cluster', 'fetch', '--refresh-config'] on ores1002.eqiad.wmnet returned [70]: http://tin.eqiad.wmnet/ores/deploy/.git
14From http://tin.eqiad.wmnet/ores/deploy/
15 * [new branch] STABLE_REVSCORING_1 -> origin/STABLE_REVSCORING_1
16 * [new branch] master -> origin/master
17 * [new tag] scap/sync/2017-10-30/0005 -> scap/sync/2017-10-30/0005
18/srv/deployment/ores/deploy-cache/cache
19From /srv/deployment/ores/deploy-cache/cache
20 * [new branch] master -> origin/master
21 * [new tag] scap/sync/2017-10-30/0005 -> scap/sync/2017-10-30/0005
22Synchronizing submodule url for 'submodules/draftquality'
23Synchronizing submodule url for 'submodules/editquality'
24Synchronizing submodule url for 'submodules/ores'
25Synchronizing submodule url for 'submodules/wheels'
26Synchronizing submodule url for 'submodules/wikiclass'
27error: RPC failed; result=22, HTTP code = 504
28fatal: The remote end hung up unexpectedly
29Unable to fetch in submodule path 'submodules/editquality'
30
31ores/deploy: fetch stage(s): 100% (ok: 0; fail: 1; left: 0)
3219:58:29 1 targets had deploy errors
3319:58:29 1 targets failed
3419:58:29 1 of 1 cluster targets failed, exceeding limit

I think apache on tin is timing out when trying to fetch down editqualtiy the .git objects directory of which is 2.2GB.

Related Objects

Event Timeline

A workaround over the short-term may be to use git_upstream_submodules: True in the scap.cfg file. This would cause a fetch of the submodules from whatever is in the .gitmodules file in the repo on tin. This means that any local changes on tin won't be reflected in the checkout on the targets, but hopefully this is a workaround that won't have to stay in place forever.

hrm. I was able to clone this locally on tin FWIW:

[thcipriani@tin ~]$ git clone http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality/ test
Cloning into 'test'...
Note: checking out '789f51a06e0c06c5b0571c1deb71a06ecb7ba40b'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

Checking out files: 100% (184/184), done
[thcipriani@tin ~]$ du -chs test/
2.5G    test/
2.5G    total

Just a point of information, we have three large repos, which add up to c. 3GB and will only grow. Our deployment cluster has 9 machines, so that's 27GB requested from the git host. We're currently hosting on Phabricator, @mmodell might want to comment on whether that's acceptable.

In some fiddling I realized this error message is coming from phab and not tin.

Found via GIT_TRACE=1 directly on the ores1002 server:

deploy-service@ores1002:/srv/deployment/ores/deploy-cache/revs/a0f7d5ca81e6ed857f67cafa37ee046e20fd7d2d$ GIT_TRACE=1 git submodule update --init --recursive submodules/ed
itquality
21:06:10.216044 git.c:554               trace: exec: 'git-submodule' 'update' '--init' '--recursive' 'submodules/editquality'
21:06:10.216104 run-command.c:341       trace: run_command: 'git-submodule' 'update' '--init' '--recursive' 'submodules/editquality'
21:06:10.224877 git.c:349               trace: built-in: git 'rev-parse' '--git-dir'
21:06:10.229008 git.c:349               trace: built-in: git 'rev-parse' '-q' '--git-dir'
21:06:10.232962 git.c:349               trace: built-in: git 'rev-parse' '--show-prefix'
21:06:10.234804 git.c:349               trace: built-in: git 'rev-parse' '--show-toplevel'
21:06:10.238711 git.c:349               trace: built-in: git 'rev-parse' '--sq' '--prefix' '' '--' 'submodules/editquality'
21:06:10.241355 git.c:349               trace: built-in: git 'ls-files' '-z' '--error-unmatch' '--stage' '--' 'submodules/editquality'
21:06:10.247684 git.c:349               trace: built-in: git 'config' '-f' '.gitmodules' '--get-regexp' '^submodule\..*\.path$'
21:06:10.251356 git.c:349               trace: built-in: git 'config' 'submodule.submodules/editquality.url'
21:06:10.253172 git.c:349               trace: built-in: git 'config' '-f' '.gitmodules' 'submodule.submodules/editquality.update'
21:06:10.255572 git.c:349               trace: built-in: git 'rev-parse' '--sq' '--prefix' '' '--' 'submodules/editquality'
21:06:10.257986 git.c:349               trace: built-in: git 'ls-files' '-z' '--error-unmatch' '--stage' '--' 'submodules/editquality'
21:06:10.263854 git.c:349               trace: built-in: git 'config' '-f' '.gitmodules' '--get-regexp' '^submodule\..*\.path$'
21:06:10.266877 git.c:349               trace: built-in: git 'config' 'submodule.submodules/editquality.url'
21:06:10.269093 git.c:349               trace: built-in: git 'config' 'submodule.submodules/editquality.branch'
21:06:10.271004 git.c:349               trace: built-in: git 'config' '-f' '.gitmodules' 'submodule.submodules/editquality.branch'
21:06:10.273096 git.c:349               trace: built-in: git 'config' 'submodule.submodules/editquality.update'
21:06:10.275631 git.c:349               trace: built-in: git 'rev-parse' '--local-env-vars'
21:06:10.277363 git.c:349               trace: built-in: git 'rev-parse' '--verify' 'HEAD'
21:06:10.279716 git.c:349               trace: built-in: git 'rev-parse' '--local-env-vars'
21:06:10.284659 git.c:349               trace: built-in: git 'fetch'
21:06:10.285220 run-command.c:341       trace: run_command: 'git-remote-https' 'origin' 'https://phabricator.wikimedia.org/source/editquality.git'
21:06:10.530510 run-command.c:341       trace: run_command: 'rev-list' '--objects' '--stdin' '--not' '--all' '--quiet'
21:06:10.533486 run-command.c:341       trace: run_command: 'fetch-pack' '--stateless-rpc' '--stdin' '--lock-pack' '--include-tag' '--thin' 'https://phabricator.wikimedia
.org/source/editquality.git/'
21:06:10.534336 exec_cmd.c:134          trace: exec: 'git' 'fetch-pack' '--stateless-rpc' '--stdin' '--lock-pack' '--include-tag' '--thin' 'https://phabricator.wikimedia.
org/source/editquality.git/'
21:06:10.535962 git.c:349               trace: built-in: git 'fetch-pack' '--stateless-rpc' '--stdin' '--lock-pack' '--include-tag' '--thin' 'https://phabricator.wikimedi
a.org/source/editquality.git/'
error: RPC failed; result=22, HTTP code = 504
fatal: The remote end hung up unexpectedly
Unable to fetch in submodule path 'submodules/editquality'

We should be:

  1. Writing .gitmodules
  2. Calling git submodule update --init

Which should have the effect of rewriting all of the submodules in .git/config to tin. But when I got to ores1002 I saw:

deploy-service@ores1002:/srv/deployment/ores/deploy-cache/revs/a0f7d5ca81e6ed857f67cafa37ee046e20fd7d2d$ cat .git/config
[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
[branch "master"]
[submodule "submodules/draftquality"]
        url = https://phabricator.wikimedia.org/source/draftquality.git
[submodule "submodules/editquality"]
        url = https://phabricator.wikimedia.org/source/editquality.git
[submodule "submodules/ores"]
        url = https://phabricator.wikimedia.org/source/ores.git
[submodule "submodules/wheels"]
        url = https://phabricator.wikimedia.org/source/ores-deploy-wheels.git
[submodule "submodules/wikiclass"]
        url = https://phabricator.wikimedia.org/source/wikiclass.git
[remote "origin"]
        url = /srv/deployment/ores/deploy-cache/cache
        fetch = +refs/heads/*:refs/remotes/origin/*

Mentioned in SAL (#wikimedia-operations) [2017-10-30T21:53:19Z] <awight@tin> Started deploy [ores/deploy@a0f7d5c]: Fun with deployments (non-production) T179336

Mentioned in SAL (#wikimedia-operations) [2017-10-30T21:55:22Z] <awight@tin> Finished deploy [ores/deploy@a0f7d5c]: Fun with deployments (non-production) T179336 (duration: 02m 03s)

Phabricator has this really horrible bug (T4369): you can't fetch git repositories larger than 2 GB over HTTPS, though git+ssh works fine.

We should be:

  1. Writing .gitmodules
  2. Calling git submodule update --init

Which should have the effect of rewriting all of the submodules in .git/config to tin. But when I got to ores1002 I saw:

I can't figure out why this doesn't seem to be working anymore since {D826}

thcipriani claimed this task.

The problem that this specific task deals with was fixed by removing the specific revision being deployed on the 1 target server that was affected. This may have been a weird interaction between scap and an ORES check https://github.com/wikimedia/mediawiki-services-ores-deploy/blob/master/scap/cmd_worker.sh#L4-L5

Closing this task as the specific problem for which I created this task is resolved, and there is a bit of a rewrite of how submodules are checked out in scap master (on beta) that may make anything done here moot. Reopen if problems persist.