Page MenuHomePhabricator

Support git-lfs in scap
Closed, ResolvedPublic

Description

What it says on the tin. git-lfs has become the de facto standard for large file support in Git. It works on Github, has support in Gerrit and Phabricator. Clients are far easier to install than git-fat or git-annex.

Documentation (Work in progress) on wikitech: https://wikitech.wikimedia.org/wiki/Git-lfs

This task is complete when scap can deploy a new or existing repository, relying on LFS submodules and large files.

Revisions and Commits

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Here's the current scap error from beta cluster deployment:

cd /srv/deployment/ores/deploy
git fetch
# This branch has a git-lfs submodule, in "submodules/assets"
git checkout "git-lfs"
git submodule sync
git submodule update -i
scap deploy "Test git-lfs in ORES"

17:23:21 [deployment-ores01.deployment-prep.eqiad.wmflabs] deploy-local failed: <ErrorReturnCode_1> {u'full_cmd': u'/usr/bin/git submodule update --init --recursive --jobs 2 --reference /srv/deployment/ores/deploy-cache/cache', u'stderr': u"Submodule 'submodules/assets' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/assets) registered for path 'submodules/assets'
Submodule 'submodules/draftquality' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/draftquality) registered for path 'submodules/draftquality'
Submodule 'submodules/editquality' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/editquality) registered for path 'submodules/editquality'
Submodule 'submodules/ores' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/ores) registered for path 'submodules/ores'
Submodule 'submodules/wheels' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/wheels) registered for path 'submodules/wheels'
Submodule 'submodules/wikiclass' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/wikiclass) registered for path 'submodules/wikiclass'
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/assets'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/draftquality'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/editquality'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/ores'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/wheels'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/wikiclass'... 
Downloading test.bin.gz (38 B)
Error downloading object: test.bin.gz (90c9e0f): Smudge error: Error downloading test.bin.gz (90c9e0ffe481dcee7a8246579eaa0c15f3e2f0931dc1fe98452ce5abc713cc2c): batch response: Repository or object not found: http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/assets.git/info/lfs/objects/batch
Check that it exists and that you have proper access to it 

Errors logged to /srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/.git/modules/submodules/assets/lfs/objects/logs/20180409T172311.966800243.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: test.bin.gz: smudge filter lfs failed
Unable to checkout '5d3147d4330c9c5caf3ce37fef6e9366639ed3a5' in submodule path 'submodules/assets'

@mmodell I'm still stuck, see the previous comment. Maybe it has to do with URL rewriting?

@awight: thanks, I haven't had much chance to work on this due to the train taking up most of my time these past two weeks. I'm not deploying the train this week so I'll be able to focus 100% on git-lfs.

@mmodell I'm still stuck, see the previous comment. Maybe it has to do with URL rewriting?

Indeed that seems like it might be the issue. I think we will have to work around the rewriting for lfs purposes. I'll give it a try, will let you know asap.

Change 425710 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Enable git_upstream_submodules

https://gerrit.wikimedia.org/r/425710

Mentioned in SAL (#wikimedia-cloud) [2018-04-11T21:04:35Z] <awight> ORES experiment with git-lfs, T180627

Errors thrown by my experiment with git_upstream_submodules: True:

scap deploy-log -v -f scap/log/scap-sync-2018-04-11-0001.log
....
  RAN: /usr/bin/git submodule update --init --recursive --jobs 2 --reference /srv/deployment/ores/deploy-cache/cache
....
Cloning into '/srv/deployment/ores/deploy-cache/revs/4edd34d71671fd211be478305b21db634f5f0cd0/submodules/draftquality'...
error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504 Gateway Time-out
fatal: The remote end hung up unexpectedly
fatal: clone of 'https://phabricator.wikimedia.org/source/draftquality.git' into submodule path '/srv/deployment/ores/deploy-cache/revs/4edd34d71671fd211be478305b21db634f5f0cd0/submodules/draftquality' failed
Failed to clone 'submodules/draftquality'. Retry scheduled
Cloning into '/srv/deployment/ores/deploy-cache/revs/4edd34d71671fd211be478305b21db634f5f0cd0/submodules/editquality'...
error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504 Gateway Time-out
fatal: The rem... (1864 more, please see e.stderr)
error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504 Gateway Time-out

Herein lies the clue for this failure. It took too long!

Change 425717 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Point submodules at gerrit

https://gerrit.wikimedia.org/r/425717

Mentioned in SAL (#wikimedia-cloud) [2018-04-12T23:01:04Z] <awight> Try gerrit-based submodules for ORES, T180627

awight renamed this task from [Blocked] Support git-lfs to Support git-lfs.Apr 12 2018, 11:01 PM

With the gerrit-based submodule workaround, git-lfs is in business on the beta cluster! We have normal, working ORES install with a small LFS-hosted gzip file in an LFS-enabled submodule.

I'll leave this task open until we demonstrate LFS working on production.

Production was unsuccessful, but we're really close!

ssh tin

git fetch https://gerrit.wikimedia.org/r/mediawiki/services/ores/deploy refs/changes/13/419613/5 && git checkout FETCH_HEAD
scap deploy -l "ores1001.*" -v "Canary ores1001 only: Limited test of git-lfs for ORES"
ssh ores1001

cd /srv/deployment/ores/deploy
awight@ores1001:/srv/deployment/ores/deploy$ file submodules/assets/test.bin.gz
submodules/assets/test.bin.gz: ASCII text
awight@ores1001:/srv/deployment/ores/deploy$ cat !$
cat submodules/assets/test.bin.gz
version https://git-lfs.github.com/spec/v1
oid sha256:90c9e0ffe481dcee7a8246579eaa0c15f3e2f0931dc1fe98452ce5abc713cc2c
size 38

Change 425931 had a related patch set uploaded (by Awight; owner: Awight):
[scoring/ores/assets@test_lfs_data] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/425931

Change 425931 merged by Awight:
[scoring/ores/assets@test_lfs_data] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/425931

@mmodell Do you have an idea how the various git caches will react to the large files? I noticed that the .git/modules/submodules/assets directory is 2x the size of the first checked-out file, and now I'm worried about the system-wide git cache. Is it possible that we'll be storing 4 or 5 full copies of these large files, on each machine?

@awight: hrm, well, no, the whole point of git-lfs is to avoid that! AFAIK it doesn't keep stuff around in git cache because the files' contents don't actually exist in git... Am I wrong?

It's reasonable that git might keep copies of the file around, to make it possible to switch branches for example without incurring huge bandwidth. I went looking for the data and found,

One option is that we remove .git from each revision after deployment, it's currently taking up 3.5GB per revision in production and is never used. There might also be some inefficiences in git-lfs caching things, I'd like to understand that part better.

Here's the deployed profile of a pre-lfs repo,

4.4G⠀47d9a6bad234c16f622d75d7ae8bd9245f94e3df/
  3.4G⠀⠀⠀47d9a6bad234c16f622d75d7ae8bd9245f94e3df/.git/

and the deployed repo after adding the word2vec file via LFS,

7.6G⠀70cdbb2c89b179da3094c0e80959e304487d1a93/
  5.0G⠀⠀⠀70cdbb2c89b179da3094c0e80959e304487d1a93/.git
    1.6G⠀70cdbb2c89b179da3094c0e80959e304487d1a93/.git/modules/submodules/assets/
  1.6G⠀⠀⠀70cdbb2c89b179da3094c0e80959e304487d1a93/submodules/assets/

Where would I look for other git caches?

Change 425717 abandoned by Awight:
Point submodules at gerrit

https://gerrit.wikimedia.org/r/425717

Change 419642 abandoned by Awight:
[DNM] Update to assets with an LFS file

https://gerrit.wikimedia.org/r/419642

Change 419759 abandoned by Awight:
[DNM] Configure scap to do git-lfs

https://gerrit.wikimedia.org/r/419759

Change 425710 abandoned by Awight:
Enable git_upstream_submodules

https://gerrit.wikimedia.org/r/425710

Change 429846 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Point submodules at gerrit

https://gerrit.wikimedia.org/r/429846

Change 429859 had a related patch set uploaded (by Awight; owner: Awight):
[scoring/ores/assets@master] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/429859

Change 429859 merged by Awight:
[scoring/ores/assets@master] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/429859

Change 419637 abandoned by Awight:
LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/419637

@awight: I think the latest problem is that we don't run git lfs install automatically on targets. I've ran the command manually on ores1001, can you test another deployment and see if it gets the lfs objects now?

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:48:10Z] <awight@tin> Started deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:48:16Z] <awight@tin> Started deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:48:42Z] <awight@tin> Finished deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627 (duration: 00m 26s)

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:50:53Z] <awight@tin> Started deploy [ores/deploy@52347e0]: Test LFS deployment for ORES; T180627

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:54:15Z] <awight@tin> Finished deploy [ores/deploy@52347e0]: Test LFS deployment for ORES; T180627 (duration: 03m 21s)

Good news! We've done an initial LFS deployment of the 1.6GB word2vec binary and it landed successfully on ores1001! There's one last detail to clean up, that git lfs install must be run by scap.

mmodell added a revision: Restricted Differential Revision.May 1 2018, 11:27 PM

FWIW, this means we're waiting for scap 3.8.1 (scap version) to land in production.

15:50 < awight> twentyafterfour: bad news, my test LFS deployment failed to pull the files again
15:51 < awight> tin:/srv/deployment/ores/deploy/scap/log/scap-sync-2018-05-01-0003-1-gae55746.log
15:51 < awight> git lfs install --global did happen
15:51 < awight> ah.
15:52 < awight> It's probably because git lfs install happened after submodule update -i.

15:50 < awight> twentyafterfour: bad news, my test LFS deployment failed to pull the files again
15:51 < awight> tin:/srv/deployment/ores/deploy/scap/log/scap-sync-2018-05-01-0003-1-gae55746.log
15:51 < awight> git lfs install --global did happen
15:51 < awight> ah.
15:52 < awight> It's probably because git lfs install happened after submodule update -i.

You know, we should probably install git-lfs everywhere we can, just like git, and do git lfs install --global as part of it.

You know, we should probably install git-lfs everywhere we can, just like git, and do git lfs install --global as part of it.

That would work for me. I think I'm about to simulate that by running a dummy deployment on the ORES boxes, scap deploy -s fetch.

Change 429846 merged by Awight:
[mediawiki/services/ores/deploy@master] Point submodules at gerrit

https://gerrit.wikimedia.org/r/429846

Change 419613 merged by Awight:
[mediawiki/services/ores/deploy@master] Add the assets submodule and word2vec, git-lfs enabled

https://gerrit.wikimedia.org/r/419613

You know, we should probably install git-lfs everywhere we can, just like git, and do git lfs install --global as part of it.

+1 we could just handle the git lfs config via puppet.

awight renamed this task from Support git-lfs to Support git-lfs in scap.Jun 4 2018, 9:28 PM
awight raised the priority of this task from Medium to High.
awight updated the task description. (Show Details)

I tried to do a deployment to ores2001 today, running three passes thus:

  1. Deploy the master code (d77e52c) in case LFS works "out of the box"
    • LFS file is not pulled down.
  2. Deploy using "-f" to attempt to refresh the repo, assuming "git lfs init" has been run at some point during step (1)
    • Scap returns after only a few seconds of "fetch" stage, so it seems the "-f" flag doesn't force a full new checkout as I'd hoped.
    • The LFS file is still not pulled down.
  3. Commit a dummy revision (rORESDEPLOYa2d440bb77db), re-deploy a full checkout.
    • LFS file still not there.

I'm happy to do forensics on ores2001, if that's any help.

Thanks for your tests, @awight, and for bumping the prioritization. FWIW, we have had this model deployed in our Cloud VPS cluster (using fabric rather than scap) for a couple of weeks now, so it seems this is our last blocker for getting these models out to the world.

I don't quite know what could be going wrong but I'm looking into it.

ok it seems that the .gitconfig for git lfs wasn't installed. We should really do this via puppet.

the magic command to initialize lfs on a given $target:

SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service $target git lfs install

This should be happening automatically, I'm not really sure why it isn't.

@mmodell Awesome, thanks for this workaround. I confirmed that running your command from deploy1001 made the following scap correctly clone our repo. I may end up applying the workaround on all ores* boxes, but am also writing a puppet patch for posterity.

Change 437719 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Initialize LFS on scap targets

https://gerrit.wikimedia.org/r/437719

Change 437719 abandoned by Awight:
Install LFS on scap targets

Reason:
This might be enough, Icde5d4e6d9c6b7

https://gerrit.wikimedia.org/r/437719