Page MenuHomePhabricator

[Epic] ORES should use a git large file plugin for storing serialized binaries
Closed, ResolvedPublic

Description

It looks though ORES is currently deploying large binary blobs from git via scap3. Unfortunately, git does not scale when checking in large binary objects--they do not diff well so git can only pack them but so tightly. Performance quickly becomes an issue.

Scap has the ability to move large binaries about by using git-fat to fetch them over rsync from some source. We should figure out where to fetch these files from so we can have a much smaller (and usable) repository.

Docs on setting up git-fat (don't worry, it's not really archiva-specific).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Striker could use this too. It has the same sort of wheel blob repo as ORES.

I don't want to hijack this task, but for the record our binary problem is much more severe in the editquality repo, where a few dozen 10MB models change rapidly, swelling our git repo to >1.5GB and bringing tears to our eyes during both development and deployment. Also noting that the blocker to this task as stated (also the broader task) is that we deploy these repos to both labs and production, and according to oral tradition git-fat doesn't support labs deployments yet.

Yeah, let's do it for all the repos with binaries :) And we'll figure out something re labs + prod re git-fat access.

Will it work with out github repos? We do development mainly against them and mirror to diffusion/gerrit.

For checking in, yes without issue (that's all client side). However fetch/pull requires rsync access to fatten the blobs (internal term is hydrate).

This complicates things a tad--RelEng doesn't officially support deploying from Github--but it's not undoable.

but it's not undoable.

What?

We do github based deploys in our Cloud VPS cluster (not beta) using fabric. Part of that fabric script performs a git pull from our github repos.

AFAICT, there's no good option for doing scap-based deploys in Cloud VPS (outside of beta). Is that still true?

AFAICT, there's no good option for doing scap-based deploys in Cloud VPS (outside of beta). Is that still true?

It is certainly possible, but you need to run a deploy server in your Cloud VPS project. There is no good way for Cloud Services to provide a shared scap deployment server for all/multiple Cloud VPS projects. If you are interested in running your own deploy server I have some basic instructions at https://wikitech.wikimedia.org/wiki/User:BryanDavis/Scap3_in_a_Labs_project that I wrote up when I built the service out for https://tools.wmflabs.org/openstack-browser/project/striker.

Gotcha. Maybe we could stick that in our ores-staging project.

@demon, if that makes things easier for you, we can block this on getting scap set up for VPS deployments. If we do, we would presumably be doing things the same way we would in beta-labs. Git-fat must work there, right?

I talked to @Paladox -- who offered that we could make use of phab-tin. Would it be better to set up our own deployment server within ores or ores-staging?

@demon We're fine with deploying from WMF production repos, I'm sure we can figure something out to push mirrored code or just make these the masters for deployment. In other words, we're not trying to deploy directly from GitHub.

I think the question is, can WMF host the git-fat server in a way that we can pull from it for both production and labs deployment? That should solve 90% of our woes.

(Just doing some project management, don't worry about the "watching" bit, we'll just create a (sub)task for any bits of this that we need to do.)

but it's not undoable.

What?

Bad choice of words. I meant it's not impossible. Sorry for the confusion.

@demon We're fine with deploying from WMF production repos, I'm sure we can figure something out to push mirrored code or just make these the masters for deployment. In other words, we're not trying to deploy directly from GitHub.

I think the question is, can WMF host the git-fat server in a way that we can pull from it for both production and labs deployment? That should solve 90% of our woes.

Yes. That's the issue we need to tackle--and exactly what I meant by complicates things. We need to simplify the usage of git-fat so things outside of production can make use of it.

Halfak renamed this task from ORES should use git-fat for wheel deployments to ORES should use git-fat for binaries.Jul 27 2017, 2:52 PM
Halfak changed the task status from Open to Stalled.
Halfak raised the priority of this task from Medium to High.
Halfak changed the status of subtask T171758: Support git-lfs files in gerrit from Open to Stalled.
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.
awight renamed this task from ORES should use git-fat for binaries to ORES should use a git large file plugin for storing serialized binaries.Sep 9 2017, 12:54 AM
Paladox changed the task status from Stalled to Open.Nov 29 2017, 8:49 PM
awight renamed this task from ORES should use a git large file plugin for storing serialized binaries to [Epic] ORES should use a git large file plugin for storing serialized binaries.Oct 10 2018, 11:24 PM
awight moved this task from Parked to Non-Epic on the Machine-Learning-Team (Active Tasks) board.
awight added a project: Epic.
Halfak claimed this task.

Seems like this is done.