Support git-lfs in scap
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• demon
	Nov 15 2017, 7:26 PM

Description

What it says on the tin. git-lfs has become the de facto standard for large file support in Git. It works on Github, has support in Gerrit and Phabricator. Clients are far easier to install than git-fat or git-annex.

Documentation (Work in progress) on wikitech: https://wikitech.wikimedia.org/wiki/Git-lfs

This task is complete when scap can deploy a new or existing repository, relying on LFS submodules and large files.

Details

Subject	Repo	Branch	Lines +/-
Install LFS on scap targets	operations/puppet	production	+23 -0
Add the assets submodule and word2vec, git-lfs enabled	mediawiki/services/ores/deploy	master	+6 -0
Point submodules at gerrit	mediawiki/services/ores/deploy	master	+5 -5
LFS enabled, word2vec upload	scoring/ores/assets	master	+4 -0
LFS enabled, word2vec upload	scoring/ores/assets	master	+4 -0
LFS enabled, word2vec upload	scoring/ores/assets	test_lfs_data	+4 -0
Enable git_upstream_submodules	mediawiki/services/ores/deploy	master	+1 -0
[DNM] Configure scap to do git-lfs	mediawiki/services/ores/deploy	master	+1 -0
[DNM] Update to assets with an LFS file	mediawiki/services/ores/deploy	master	+1 -1
Point submodules at gerrit	mediawiki/services/ores/deploy	master	+5 -6
Automatically merge on a new repo	integration/config	master	+6 -0
[DNM] LFS enabled, small test file included	scoring/ores/assets	test_lfs_data	+4 -0

Revisions and Commits

Restricted Differential Revision

Related Objects
Search...

Status	Assigned	Task
Resolved	awight	T187217 [Epic] Support word2vec for production ORES models
Resolved	awight	T188446 Package word2vec binaries
Resolved	Halfak	T171619 [Epic] ORES should use a git large file plugin for storing serialized binaries
Resolved	awight	T181678 Plan migration of ORES repos to git-lfs
Resolved	None	T176324 Scoring platform team FY18 Q2
Resolved	Halfak	T183198 Scoring Platform FY18 Q3
Resolved	awight	T176336 Deploy drafttopic model to production ORES
Resolved	• mmodell	T180627 Support git-lfs in scap
Resolved	awight	T180628 Install git-lfs client (at least on scap targets & masters)
Declined	None	T182085 Connect Phabricator to swift for storage of git-lfs and file uploads.
Resolved	• mmodell	T192042 Create gerrit mirrors for all github-based ORES repos
Resolved	fgiunchedi	T192124 Deploy Scap 3.8.0 to production

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Here's the current scap error from beta cluster deployment:

cd /srv/deployment/ores/deploy
git fetch
# This branch has a git-lfs submodule, in "submodules/assets"
git checkout "git-lfs"
git submodule sync
git submodule update -i
scap deploy "Test git-lfs in ORES"

17:23:21 [deployment-ores01.deployment-prep.eqiad.wmflabs] deploy-local failed: <ErrorReturnCode_1> {u'full_cmd': u'/usr/bin/git submodule update --init --recursive --jobs 2 --reference /srv/deployment/ores/deploy-cache/cache', u'stderr': u"Submodule 'submodules/assets' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/assets) registered for path 'submodules/assets'
Submodule 'submodules/draftquality' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/draftquality) registered for path 'submodules/draftquality'
Submodule 'submodules/editquality' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/editquality) registered for path 'submodules/editquality'
Submodule 'submodules/ores' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/ores) registered for path 'submodules/ores'
Submodule 'submodules/wheels' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/wheels) registered for path 'submodules/wheels'
Submodule 'submodules/wikiclass' (http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/wikiclass) registered for path 'submodules/wikiclass'
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/assets'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/draftquality'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/editquality'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/ores'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/wheels'...
Cloning into '/srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/submodules/wikiclass'... 
Downloading test.bin.gz (38 B)
Error downloading object: test.bin.gz (90c9e0f): Smudge error: Error downloading test.bin.gz (90c9e0ffe481dcee7a8246579eaa0c15f3e2f0931dc1fe98452ce5abc713cc2c): batch response: Repository or object not found: http://deployment-tin.deployment-prep.eqiad.wmflabs/ores/deploy/.git/modules/submodules/assets.git/info/lfs/objects/batch
Check that it exists and that you have proper access to it 

Errors logged to /srv/deployment/ores/deploy-cache/revs/d54a2577626b7164404ea8135d8a5cb5f4c00da8/.git/modules/submodules/assets/lfs/objects/logs/20180409T172311.966800243.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: test.bin.gz: smudge filter lfs failed
Unable to checkout '5d3147d4330c9c5caf3ce37fef6e9366639ed3a5' in submodule path 'submodules/assets'

@mmodell I'm still stuck, see the previous comment. Maybe it has to do with URL rewriting?

@awight: thanks, I haven't had much chance to work on this due to the train taking up most of my time these past two weeks. I'm not deploying the train this week so I'll be able to focus 100% on git-lfs.

In T180627#4117797, @awight wrote:

@mmodell I'm still stuck, see the previous comment. Maybe it has to do with URL rewriting?

Indeed that seems like it might be the issue. I think we will have to work around the rewriting for lfs purposes. I'll give it a try, will let you know asap.

• mmodell claimed this task.Apr 9 2018, 7:13 PM

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptApr 9 2018, 7:13 PM

Change 425710 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Enable git_upstream_submodules

https://gerrit.wikimedia.org/r/425710

awight mentioned this in rORESDEPLOY1a7c00dd8115: [DNM] Experimental git-lfs submodule.Apr 11 2018, 9:03 PM

awight mentioned this in rORESDEPLOYfae2bd903ee9: [DNM] Update to assets with an LFS file.

awight mentioned this in rORESDEPLOY7bf81dddbfff: [DNM] Configure scap to do git-lfs.

awight mentioned this in rORESDEPLOY4edd34d71671: Enable git_upstream_submodules.

Mentioned in SAL (#wikimedia-cloud) [2018-04-11T21:04:35Z] <awight> ORES experiment with git-lfs, T180627

Errors thrown by my experiment with git_upstream_submodules: True:

scap deploy-log -v -f scap/log/scap-sync-2018-04-11-0001.log
....
  RAN: /usr/bin/git submodule update --init --recursive --jobs 2 --reference /srv/deployment/ores/deploy-cache/cache
....
Cloning into '/srv/deployment/ores/deploy-cache/revs/4edd34d71671fd211be478305b21db634f5f0cd0/submodules/draftquality'...
error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504 Gateway Time-out
fatal: The remote end hung up unexpectedly
fatal: clone of 'https://phabricator.wikimedia.org/source/draftquality.git' into submodule path '/srv/deployment/ores/deploy-cache/revs/4edd34d71671fd211be478305b21db634f5f0cd0/submodules/draftquality' failed
Failed to clone 'submodules/draftquality'. Retry scheduled
Cloning into '/srv/deployment/ores/deploy-cache/revs/4edd34d71671fd211be478305b21db634f5f0cd0/submodules/editquality'...
error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504 Gateway Time-out
fatal: The rem... (1864 more, please see e.stderr)

error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504 Gateway Time-out

Herein lies the clue for this failure. It took too long!

• mmodell closed subtask T192042: Create gerrit mirrors for all github-based ORES repos as Resolved.Apr 12 2018, 8:03 PM

Change 425717 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Point submodules at gerrit

https://gerrit.wikimedia.org/r/425717

awight mentioned this in rORESDEPLOYd079f87d40d1: [DNM] Experimental git-lfs submodule.Apr 12 2018, 10:54 PM

awight mentioned this in rORESDEPLOY91d8f2664ca1: Point submodules at gerrit.

Mentioned in SAL (#wikimedia-cloud) [2018-04-12T23:01:04Z] <awight> Try gerrit-based submodules for ORES, T180627

awight renamed this task from [Blocked] Support git-lfs to Support git-lfs.Apr 12 2018, 11:01 PM

With the gerrit-based submodule workaround, git-lfs is in business on the beta cluster! We have normal, working ORES install with a small LFS-hosted gzip file in an LFS-enabled submodule.

I'll leave this task open until we demonstrate LFS working on production.

awight mentioned this in rORESDEPLOYe8835fa146bc: Point submodules at gerrit.Apr 12 2018, 11:18 PM

awight mentioned this in rORESDEPLOYa5cec53b9d0b: Add the assets submodule, git-lfs enabled.

Production was unsuccessful, but we're really close!

ssh tin

git fetch https://gerrit.wikimedia.org/r/mediawiki/services/ores/deploy refs/changes/13/419613/5 && git checkout FETCH_HEAD
scap deploy -l "ores1001.*" -v "Canary ores1001 only: Limited test of git-lfs for ORES"

ssh ores1001

cd /srv/deployment/ores/deploy
awight@ores1001:/srv/deployment/ores/deploy$ file submodules/assets/test.bin.gz
submodules/assets/test.bin.gz: ASCII text
awight@ores1001:/srv/deployment/ores/deploy$ cat !$
cat submodules/assets/test.bin.gz
version https://git-lfs.github.com/spec/v1
oid sha256:90c9e0ffe481dcee7a8246579eaa0c15f3e2f0931dc1fe98452ce5abc713cc2c
size 38

Change 425931 had a related patch set uploaded (by Awight; owner: Awight):
[scoring/ores/assets@test_lfs_data] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/425931

Change 425931 merged by Awight:
[scoring/ores/assets@test_lfs_data] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/425931

awight added a subtask: T192124: Deploy Scap 3.8.0 to production.Apr 13 2018, 12:37 AM

awight mentioned this in R2300:8129086ff575: LFS enabled, word2vec upload.Apr 13 2018, 12:59 AM

awight mentioned this in R2300:24b2058cbeca: LFS enabled, word2vec upload.

awight mentioned this in rORESDEPLOY5fc8ee783322: Point submodules at gerrit.Apr 13 2018, 1:05 AM

awight mentioned this in rORESDEPLOYcdfebd27ee77: Add the assets submodule, git-lfs enabled.

awight mentioned this in rORESDEPLOY70c051af1f2f: [DNM] Add the assets submodule, git-lfs enabled.Apr 13 2018, 1:14 AM

@mmodell Do you have an idea how the various git caches will react to the large files? I noticed that the .git/modules/submodules/assets directory is 2x the size of the first checked-out file, and now I'm worried about the system-wide git cache. Is it possible that we'll be storing 4 or 5 full copies of these large files, on each machine?

@awight: hrm, well, no, the whole point of git-lfs is to avoid that! AFAIK it doesn't keep stuff around in git cache because the files' contents don't actually exist in git... Am I wrong?

It's reasonable that git might keep copies of the file around, to make it possible to switch branches for example without incurring huge bandwidth. I went looking for the data and found,

One option is that we remove .git from each revision after deployment, it's currently taking up 3.5GB per revision in production and is never used. There might also be some inefficiences in git-lfs caching things, I'd like to understand that part better.

Here's the deployed profile of a pre-lfs repo,

4.4G⠀47d9a6bad234c16f622d75d7ae8bd9245f94e3df/
  3.4G⠀⠀⠀47d9a6bad234c16f622d75d7ae8bd9245f94e3df/.git/

and the deployed repo after adding the word2vec file via LFS,

7.6G⠀70cdbb2c89b179da3094c0e80959e304487d1a93/
  5.0G⠀⠀⠀70cdbb2c89b179da3094c0e80959e304487d1a93/.git
    1.6G⠀70cdbb2c89b179da3094c0e80959e304487d1a93/.git/modules/submodules/assets/
  1.6G⠀⠀⠀70cdbb2c89b179da3094c0e80959e304487d1a93/submodules/assets/

Where would I look for other git caches?

Change 425717 abandoned by Awight:
Point submodules at gerrit

https://gerrit.wikimedia.org/r/425717

Change 419642 abandoned by Awight:
[DNM] Update to assets with an LFS file

https://gerrit.wikimedia.org/r/419642

Change 419759 abandoned by Awight:
[DNM] Configure scap to do git-lfs

https://gerrit.wikimedia.org/r/419759

Change 425710 abandoned by Awight:
Enable git_upstream_submodules

https://gerrit.wikimedia.org/r/425710

fgiunchedi closed subtask T192124: Deploy Scap 3.8.0 to production as Resolved.Apr 20 2018, 10:13 AM

• mmodell moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.Apr 23 2018, 4:37 PM

Change 429846 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Point submodules at gerrit

https://gerrit.wikimedia.org/r/429846

awight mentioned this in rORESDEPLOYd56417b35095: Point submodules at gerrit.Apr 30 2018, 5:55 PM

awight mentioned this in rORESDEPLOY8c586abb3cd9: [DNM] Add the assets submodule, git-lfs enabled.

awight mentioned this in rORESDEPLOY46824bb18f20: [DNM] Add the assets submodule and word2vec, git-lfs enabled.Apr 30 2018, 6:06 PM

awight mentioned this in rORESDEPLOY25579e742ecb: [DNM] Add the assets submodule and word2vec, git-lfs enabled.Apr 30 2018, 7:04 PM

Change 429859 had a related patch set uploaded (by Awight; owner: Awight):
[scoring/ores/assets@master] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/429859

Change 429859 merged by Awight:
[scoring/ores/assets@master] LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/429859

awight mentioned this in rORESDEPLOY4601497c4f43: [DNM] Add the assets submodule and word2vec, git-lfs enabled.Apr 30 2018, 8:05 PM

awight mentioned this in R2300:27b8626b5aa7: LFS enabled, word2vec upload.Apr 30 2018, 8:32 PM

Change 419637 abandoned by Awight:
LFS enabled, word2vec upload

https://gerrit.wikimedia.org/r/419637

@awight: I think the latest problem is that we don't run git lfs install automatically on targets. I've ran the command manually on ores1001, can you test another deployment and see if it gets the lfs objects now?

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:48:10Z] <awight@tin> Started deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:48:16Z] <awight@tin> Started deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:48:42Z] <awight@tin> Finished deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627 (duration: 00m 26s)

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:50:53Z] <awight@tin> Started deploy [ores/deploy@52347e0]: Test LFS deployment for ORES; T180627

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:54:15Z] <awight@tin> Finished deploy [ores/deploy@52347e0]: Test LFS deployment for ORES; T180627 (duration: 03m 21s)

Good news! We've done an initial LFS deployment of the 1.6GB word2vec binary and it landed successfully on ores1001! There's one last detail to clean up, that git lfs install must be run by scap.

• mmodell added a revision: Restricted Differential Revision.May 1 2018, 11:27 PM

• mmodell mentioned this in rMSCAf8ecbd12a5c6: do `git lfs install` before `git lfs pull`.May 1 2018, 11:34 PM

FWIW, this means we're waiting for scap 3.8.1 (scap version) to land in production.

15:50 < awight> twentyafterfour: bad news, my test LFS deployment failed to pull the files again
15:51 < awight> tin:/srv/deployment/ores/deploy/scap/log/scap-sync-2018-05-01-0003-1-gae55746.log
15:51 < awight> git lfs install --global did happen
15:51 < awight> ah.
15:52 < awight> It's probably because git lfs install happened after submodule update -i.

In T180627#4192775, @awight wrote:

15:50 < awight> twentyafterfour: bad news, my test LFS deployment failed to pull the files again
15:51 < awight> tin:/srv/deployment/ores/deploy/scap/log/scap-sync-2018-05-01-0003-1-gae55746.log
15:51 < awight> git lfs install --global did happen
15:51 < awight> ah.
15:52 < awight> It's probably because git lfs install happened after submodule update -i.

You know, we should probably install git-lfs everywhere we can, just like git, and do git lfs install --global as part of it.

You know, we should probably install git-lfs everywhere we can, just like git, and do git lfs install --global as part of it.

That would work for me. I think I'm about to simulate that by running a dummy deployment on the ORES boxes, scap deploy -s fetch.

awight mentioned this in rORESDEPLOYba5285fd1187: Point submodules at gerrit.May 9 2018, 9:00 PM

awight mentioned this in rORESDEPLOY49cb16b0eed6: Add the assets submodule and word2vec, git-lfs enabled.

Change 429846 merged by Awight:
[mediawiki/services/ores/deploy@master] Point submodules at gerrit

https://gerrit.wikimedia.org/r/429846

Change 419613 merged by Awight:
[mediawiki/services/ores/deploy@master] Add the assets submodule and word2vec, git-lfs enabled

https://gerrit.wikimedia.org/r/419613

In T180627#4193364, @demon wrote:

You know, we should probably install git-lfs everywhere we can, just like git, and do git lfs install --global as part of it.

+1 we could just handle the git lfs config via puppet.

awight merged a task: T181855: scap support for git-lfs.Jun 4 2018, 9:06 PM

awight added subscribers: Ottomata, bd808, greg and 2 others.

awight renamed this task from Support git-lfs to Support git-lfs in scap.Jun 4 2018, 9:28 PM

awight raised the priority of this task from Medium to High.

awight edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

awight updated the task description. (Show Details)

I tried to do a deployment to ores2001 today, running three passes thus:

Deploy the master code (d77e52c) in case LFS works "out of the box"
- LFS file is not pulled down.
Deploy using "-f" to attempt to refresh the repo, assuming "git lfs init" has been run at some point during step (1)
- Scap returns after only a few seconds of "fetch" stage, so it seems the "-f" flag doesn't force a full new checkout as I'd hoped.
- The LFS file is still not pulled down.
Commit a dummy revision (rORESDEPLOYa2d440bb77db), re-deploy a full checkout.
- LFS file still not there.

I'm happy to do forensics on ores2001, if that's any help.

Thanks for your tests, @awight, and for bumping the prioritization. FWIW, we have had this model deployed in our Cloud VPS cluster (using fabric rather than scap) for a couple of weeks now, so it seems this is our last blocker for getting these models out to the world.

I don't quite know what could be going wrong but I'm looking into it.

ok it seems that the .gitconfig for git lfs wasn't installed. We should really do this via puppet.

the magic command to initialize lfs on a given $target:

SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service $target git lfs install

This should be happening automatically, I'm not really sure why it isn't.

awight added a parent task: T176336: Deploy drafttopic model to production ORES.Jun 6 2018, 6:32 AM

@mmodell Awesome, thanks for this workaround. I confirmed that running your command from deploy1001 made the following scap correctly clone our repo. I may end up applying the workaround on all ores* boxes, but am also writing a puppet patch for posterity.

Change 437719 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Initialize LFS on scap targets

https://gerrit.wikimedia.org/r/437719

awight removed a project: Machine-Learning-Team (Active Tasks).Jun 7 2018, 1:40 PM

awight closed this task as Resolved.Jun 12 2018, 3:00 PM

awight closed subtask T180628: Install git-lfs client (at least on scap targets & masters) as Resolved.

Change 437719 abandoned by Awight:
Install LFS on scap targets

Reason:
This might be enough, Icde5d4e6d9c6b7

https://gerrit.wikimedia.org/r/437719

Ladsgroup reopened subtask T192042: Create gerrit mirrors for all github-based ORES repos as Open.Aug 20 2018, 7:05 PM

• mmodell changed the status of subtask T182085: Connect Phabricator to swift for storage of git-lfs and file uploads. from Open to Stalled.Jan 16 2019, 10:53 PM

Ottomata mentioned this in T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .Jan 17 2019, 3:36 PM

• mmodell closed subtask T192042: Create gerrit mirrors for all github-based ORES repos as Resolved.Mar 25 2019, 5:47 PM