Page MenuHomePhabricator

Use `git lfs` for large binary files of Design Style Guide
Open, MediumPublic

Description

In response to https://github.com/wikimedia/WikimediaUI-Style-Guide/issues/232 and the problem, that large binary files, which are updated and committed quite often are causing .git directory to overly increase due to deltas not being possible,
we should use git lfs for

  • .ai
  • .sketch and
  • .zip files in the resources directory.

Idea by @Ladsgroup


Dev/contributor notes

If Git LFS is not installed on your system, follow https://github.com/git-lfs/git-lfs/wiki/Installation
If it is, use git pull lfs after updating your clone.

Related

Event Timeline

Volker_E created this task.Oct 9 2019, 4:30 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2019, 4:30 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptOct 9 2019, 7:05 PM

I rewrote all of the commits. If you have an existing clone, throw it away and clone a new one. I think more is needed (branches need to be deleted)

Volker_E triaged this task as Medium priority.Oct 15 2019, 4:02 AM
Volker_E updated the task description. (Show Details)
Volker_E added a comment.EditedOct 15 2019, 6:23 PM

In the course we've also needed to force push gerrit clone again as history diverged.
And lfs config for design/style-guide repo needed to be added https://gerrit.wikimedia.org/r/#/c/All-Projects/+/543173/ – thanks @mmodell

In the course, the force-push privs for the Gerrit repo had to be removed temporarily in order to update the repo.

Dzahn added a subscriber: Dzahn.EditedOct 16 2019, 5:29 PM

@Ladsgroup git-lfs is not installed on the prod servers and the puppet git:::clone class also does not support changing the command yet. So this breaks cloning on the prod servers.

Change 547778 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] Add git::lfs on design/style-guide targets

https://gerrit.wikimedia.org/r/547778

It would be very good to get clarification if Git LFS is going to be supported long-term or not. @20after4 @akosiaris

@Volker_E: I agree, some clarity would be good.

Technically:

  • scap nominally supports git-lfs but we have very little experience with it in production.
  • Gerrit supports git-lfs but it's not the greatest implementation.

Beyond that, as far as institutional support, I really don't know where we stand.

cc: @thcipriani

I'd love to hear what @akosiaris and @thcipriani think about it.

Also I want to voice my huge appreciation to @Dzahn for all the help so far on this. Daniel has gone above and beyond all reasonable expectations, worked through lunch, answered tons of questions, and generally been an awesome SRE.

brennen added a subscriber: brennen.Nov 1 2019, 9:03 PM
Volker_E updated the task description. (Show Details)Nov 1 2019, 9:51 PM

@Volker_E: I agree, some clarity would be good.

Technically:

  • scap nominally supports git-lfs but we have very little experience with it in production.
  • Gerrit supports git-lfs but it's not the greatest implementation.

Beyond that, as far as institutional support, I really don't know where we stand.

cc: @thcipriani

I'd love to hear what @akosiaris and @thcipriani think about it.

I don't know if anyone is using gerrit and scap's "support" for git-lfs or not in production(?). ORES uses something for large files, but I've never looked into what exactly they do; maybe @Ladsgroup or @akosiaris could speak to that?

git-lfs is not well supported has been my impression. We support git-fat pretty well in scap — not sure if that could be made to work in this case.

Also I want to voice my huge appreciation to @Dzahn for all the help so far on this. Daniel has gone above and beyond all reasonable expectations, worked through lunch, answered tons of questions, and generally been an awesome SRE.

+1 <3 @Dzahn :)

Same here, both @mmodell and @Dzahn have provided valuable input and hands-on help extensively.

Design Team's goals and requirements have been to provide history to the whole Style Guide including binary design files used by collaborators and volunteers in and outside the Foundation. And to make the contribution hurdle to the repository as low as possible. Reason for git lfs came out of a recommendation to avoid blowing up the Git history due to the known (performance & bandwidth) limitations of Git with large binary files.

mmodell added a comment.EditedNov 1 2019, 11:22 PM

Some of the confusion is my fault, I've been saying that git-lfs works, and every indication I've seen has lead me to believe that it does. There are, howver, some unknowns:

  • Nobody in Release-Engineering-Team is quite sure that ores ever used git-lfs, for real in production or if it was just a failed experiment.
  • scap's support for lfs was committed by @chad with relatively little testing
  • I'm not 100% sure that the target machines (vega and bromine) can talk to gerrit directly. There may be a firewall in the way. For git-lfs to work, the scap deploy-local process on the targets needs to call git-lfs which will make an https request back to gerrit to download large files.

As stated by @thcipriani, git-fat is rather well tested and doesn't have so many unknowns. Since that is the more well worn path, it's probably safer to use that instead of git-lfs, at least for now. There are instructions for setting it up on wikitech: https://wikitech.wikimedia.org/wiki/Archiva#Setting_up_git-fat_for_your_project and it's not much different from git-lfs. The two have a similar design and essentially equivalent functionality.

I will modify the git-lfs page on wikitech to indicate that it's currently unsupported (e.g. use at your own risk) and that we have no immediate plans to invest in developing it further.

Note: This does not preclude working on it in the future, it just isn't on the roadmap for this quarter and would need to be planned / prioritized appropriately if we decide to do it.

Cons for git-lfs: it doesn't support ssh, only http(s), although apparently there's an authentication shim "git-lfs-authenticate". Pros for git-fat: it uses rsync for delivery of the large files, which can mean significant bandwidth and time savings.

Do we have a logical place to put files for git-fat, separated nicely by repo name? Is this a leap we want to make generally for our repos?

@Ladsgroup git-lfs is not installed on the prod servers and the puppet git:::clone class also does not support changing the command yet. So this breaks cloning on the prod servers.

That's not entirely correct. git-lfs is installed everywhere that scap::target is used, that is 598/1371 hosts. But that's tangential and more a byproduct of how we config scap and less a conscious choice.

ORES does use indeed git-lfs but it's the only thing that I know that uses it.

I'm not 100% sure that the target machines (vega and bromine) can talk to gerrit directly. There may be a firewall in the way. For git-lfs to work, the scap deploy-local process on the targets needs to call git-lfs which will make an https request back to gerrit to download large files.

They can.

Do we have a logical place to put files for git-fat, separated nicely by repo name? Is this a leap we want to make generally for our repos?

This has been historically connected to archiva, so that jar files are deployed without needing to have them in the git repo itself. And as far as I know it can be separated by repo name.

Do note that github.com does limit git lfs bandwith[1] to 1G and used space to 1G. It's surprisingly easy to reach them. Do note that this might cause issues to contributors even if we manage to avoid them (e.g. using gerrit)

I will modify the git-lfs page on wikitech to indicate that it's currently unsupported (e.g. use at your own risk) and that we have no immediate plans to invest in developing it further.

I think this line alone responds to the question of whether we should be investing in git-lfs more or not. Unless some other team can show up and take over the work of supporting git-lfs.

[1] https://help.github.com/en/github/managing-large-files/about-storage-and-bandwidth-usage

Change 547778 abandoned by 20after4:
Add git::lfs on design/style-guide targets

Reason:
unneeded

https://gerrit.wikimedia.org/r/547778

Dzahn added a comment.Nov 6 2019, 7:30 PM

Just wanted to add something i recently noticed while upgrading the Gerrit replica. 44% of the entire Gerrit git data size was "design" due to the large files.

But when i checked now again i see it has been fixed and we are back to a much more reasonable 3.5%

The size of /srv/gerrit/git (all of Gerrit git data) is now down to just 38G again.

So thanks to who pruned the history there or something similar.

But when i checked now again i see it has been fixed and we are back to a much more reasonable 3.5%

The size of /srv/gerrit/git (all of Gerrit git data) is now down to just 38G again.

Interesting.

Much appreciation to @akosiaris and @ArielGlenn for further clarity on this! You've cleared up a lot of confusion on this issue.

Dzahn added a comment.EditedNov 6 2019, 8:15 PM

Interesting.

Actually 1001 and 2001 are different. I noticed it when upgrading 2001.

[gerrit1001:/] $ sudo du -hs /srv/gerrit/git
38G	/srv/gerrit/git
[gerrit1001:/] $ sudo du -hs /srv/gerrit/git/design
1.3G	/srv/gerrit/git/design
[gerrit2001:~] $ sudo du -hs /srv/gerrit/git
67G	/srv/gerrit/git
[gerrit2001:~] $ sudo du -hs /srv/gerrit/git/design
30G	/srv/gerrit/git/design
[gerrit1001:~] $ du -hs /srv/gerrit/git/design/**
13M	/srv/gerrit/git/design/landing-page.git
39M	/srv/gerrit/git/design/strategy.git
1.3G	/srv/gerrit/git/design/style-guide.git
[gerrit2001:/srv/gerrit/git] $ du -hs design/**
13M	design/landing-page.git
39M	design/strategy.git
30G	design/style-guide.git
Paladox added a subscriber: Paladox.Nov 6 2019, 9:00 PM

So i'm not entirly sure how gerrit2001 managed to grow like this. I wonder if we should delete on gerrit2001 and just replicate a fresh copy to see.

Most of that 30GB is in packfiles. Those packfiles all seem to contain the same objects:

thcipriani@gerrit2001:~$ du -chs /srv/gerrit/git/design/style-guide.git/objects/pack/* | sort -rh | head -4
30G     total
220M    /srv/gerrit/git/design/style-guide.git/objects/pack/pack-dc13f6820e29fa1245d9ecbb804edb77e63224ee.pack
220M    /srv/gerrit/git/design/style-guide.git/objects/pack/pack-69e918208071f2dac83e1ab9d3adc9d85f239a5c.pack
220M    /srv/gerrit/git/design/style-guide.git/objects/pack/pack-687d80a14d5590ade85bc101eb042671d187bf64.pack
thcipriani@gerrit2001:~$ git -C /srv/gerrit/git/design/style-guide.git verify-pack -v /srv/gerrit/git/design/style-guide.git/objects/pack/pack-dc13f6820e29fa1245d9ecbb804edb77e63224ee.idx | grep -v chain | sort -k3nr | head -n2
198f171c78e5b4f6bc37db2311cdb2435c80937f blob   39633398 29773531 13361086
4461a416b7039735425b8ac366a49f650ad6f3fc blob   39235142 38823785 187428478
thcipriani@gerrit2001:~$ git -C /srv/gerrit/git/design/style-guide.git verify-pack -v /srv/gerrit/git/design/style-guide.git/objects/pack/pack-69e918208071f2dac83e1ab9d3adc9d85f239a5c.idx | grep -v chain | sort -k3nr | head -n2
198f171c78e5b4f6bc37db2311cdb2435c80937f blob   39633398 29773531 13361827
4461a416b7039735425b8ac366a49f650ad6f3fc blob   39235142 38823785 187423835
thcipriani@gerrit2001:~$ git -C /srv/gerrit/git/design/style-guide.git verify-pack -v /srv/gerrit/git/design/style-guide.git/objects/pack/pack-687d80a14d5590ade85bc101eb042671d187bf64.idx | grep -v chain | sort -k3nr | head -n2
198f171c78e5b4f6bc37db2311cdb2435c80937f blob   39633398 29773531 13361066
4461a416b7039735425b8ac366a49f650ad6f3fc blob   39235142 38823785 187419655

^ there are just a whole bunch of pack files with similar (large) contents. My guess is that these are all generated by fetches by git's smart-protocol and will be gc'd eventually.

I'm not sure about git fat, the original repo is on github and using git fat in github doesn't look super easy to me.

Do note that github.com does limit git lfs bandwith[1] to 1G and used space to 1G. It's surprisingly easy to reach them. Do note that this might cause issues to contributors even if we manage to avoid them (e.g. using gerrit)

[1] https://help.github.com/en/github/managing-large-files/about-storage-and-bandwidth-usage

We have a coupon that makes everything in github.com/wikimedia free forever, we already pass that limit for ores so I "bought" a base-level package for git lfs for $5/month which turned to $0.0 because of the coupon but it required me to put a billing point to it, so now if you go to settings of wikimedia org in github, you will see my paypal there (please don't remove the coupon then). I asked WMF people to fix this for years and nothing happened. I'm going to remove my billing if nothing happens and then git lfs for github.com will be broken.