Page MenuHomePhabricator

git-fat replacement/removal
Open, HighPublic

Description

We run a custom git-fat package, which is in standard packages and deployed on every host. It's written in Python 2, which is EOL. Py3 support isn't completed upstream: https://github.com/jedbrown/git-fat/issues/92, so we could also collaborate with them if we want to continue to use git-fat.

We need to stop using git-fat. It's antiquated, based on Python2, and there's better ways to do this. Mostly, this means using git-lfs, but it may also mean just not using large object storage in-tree at all, or just saying screw it the blobs aren't so big. Subtasks for each usage we have.

Details

TitleReferenceAuthorSource BranchDest Branch
deploy: Fix git-lfs support (submodules)repos/releng/scap!226dancymaster-Id6607a5b8dea1a56b0fd0abf543ff14adafaf63dmaster
deploy: Fix git-lfs supportrepos/releng/scap!223dancymaster-I3d0dd46a4cffe6c89c1d9652ad0575587e267bc9master
Customize query in GitLab

Event Timeline

git-fat is the only package requiring Python 2 in a base bullseye setup at this point.

git-fat is the only package requiring Python 2 in a base bullseye setup at this point.

Is there a way to migrate to git-lfs instead?

git-fat is the only package requiring Python 2 in a base bullseye setup at this point.

Is there a way to migrate to git-lfs instead?

I'm not familiar in detail with the current use cases of git-fat, but moving to a supported different tool is probably the better path forward than porting git-fat ourselves. Both git-lfs and git-annex seem like viable alternatives to explore (both are already packaged in Debian)

I can't say for sure specially since it's part of base packages so it could be used anywhere but the only explicit usage is archiva and I hope we can find a usecase to just avoid using that. git-lfs seems to be the industry standard these days.

thcipriani added subscribers: dancy, hashar.

In our team meeting we talked about the possibility of migrating git-fat (600 lines of python2 → python3) vs. making the needed changes in scap and archiva to support git-lfs.

Tagging in @hashar and @dancy for their thoughts on this task.

@thcipriani Based on reading about git-lfs and git-fat (including outstanding issues on GitHub), I'm in favor of migrating to git-lfs and updating scap and archiva as needed. I can help on the scap side. I haven't touched archiva yet.

T214229 - scap3 + git-fat results in git status with permissions errors

T202100 - Intermittent git-fat failure during deploy

T147856 - Scap deploy failed to sync git-fat artifacts

T155856 - Package + deploy new version of git-fat

I can help on the scap side. I haven't touched archiva yet.

There is support in scap3 for git-lfs, but it's not used (as far as I'm aware) or well-tested. It *might* already work.

I honestly hadn't touched archiva either. There's a shell script (originally written by @Ottomata judging from git-blame) that moves java jars to the place git-fat expects to find them. Maybe we can just ditch that script and deploy directly from Gerrit (given we have the git-lfs extension for gerrit installed and gitlab has git-lfs support as well).

The last time we talked about git-lfs in detail that I can recall is T235013: Use `git lfs` for large binary files of Design Style Guide

deploy directly from Gerrit

...say more :)

The jar binaries are built by maven-release-plugin in a jenkins job and then uploaded to Archiva using the Archiva API. They are then synced into a git fat repo. Deploy repos then git fat add them, and scap can rsync them (via git fat) to their target hosts on deploy.

archiva-gitfat-link just scans the archiva repository directory for artifact files, and then makes symlinks to them in a git-fat folder named by their shasum, as git fat expects. I'm not familiar with how git-lfs works, but perhaps it can be made to work the same way? Is it an rsync remote?

deploy directly from Gerrit

...say more :)

The jar binaries are built by maven-release-plugin in a jenkins job and then uploaded to Archiva using the Archiva API. They are then synced into a git fat repo. Deploy repos then git fat add them, and scap can rsync them (via git fat) to their target hosts on deploy.

archiva-gitfat-link just scans the archiva repository directory for artifact files, and then makes symlinks to them in a git-fat folder named by their shasum, as git fat expects. I'm not familiar with how git-lfs works, but perhaps it can be made to work the same way? Is it an rsync remote?

I think the only transfer adapter that's Officially® supported is the http basic transfer. Our gerrit has the lfs plugin installed, so that implements the server side of git-lfs.

So rather than build jar files and upload to archiva, we'd build jar files and add .jar to .gitattributes to be managed via git-lfs, then those jars would get stored on the gerrit host. On deployment (or fetch), each target would fetch the jar via a GET request to gerrit (is my rough mental model).

Unknowns:

  • Changes to maven-release-plugin CI job—Maven supports uploading to archiva, but probably not to lfs (plus we probably don't want repo push creds in CI?)
  • Gerrit has a lot of disk space, but how much disk space do we use in archiva?
  • How many hosts are deployed in parallel and what kind of load will that put on gerrit?
  • Are targets allowed to make outbound connections to gerrit?
  • Unknowns around protocol/network traffic changes. Not expecting issues, really, but it's a change.

Mentioned in SAL (#wikimedia-operations) [2022-08-02T20:38:01Z] <mutante> re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise T313250 T243027 T279509

This depends on whether we stick on git-fat (in which case we might need to do the porting, and even it is not immediately needed since Debian Bullseye still provides python 2.7) or whether we migrate to git LFS.

This depends on whether we stick on git-fat (in which case we might need to do the porting, and even it is not immediately needed since Debian Bullseye still provides python 2.7) or whether we migrate to git LFS.

Bullseye doesn't ship Python 2.7 in a supported version, it's only included to _build_ a few packages (e.g. qtwebkit).

Bullseye doesn't ship Python 2.7 in a supported version, it's only included to _build_ a few packages (e.g. qtwebkit).

Oops my bad sorry :-\

demon renamed this task from git-fat needs to be ported to Python 3 to git-fat replacement/removal.Sep 1 2022, 1:34 PM
demon updated the task description. (Show Details)

additional affected projects:

  • search/MjoLniR/deploy
  • search/airflow

At a higher level, what we need is a good way to deploy wheels and jars as part of a deployment. I'm not particularly picky about if thats git-fat, git-lfs, or some other solution (a python package index? the archiva + git-fat solution for python packages was always a hack)

edit: I have moved the lfs config hints to the task description

Yeah, I think the newly formed Data Platform Engineering group needs to work with RelEng to build a good artifact deployment system for gitlab. We have a home grown one for airflow in our workflow_utils repo that uses scap's hooks to run a post deploy script to pull down artifacts.

We could move refinery and other deployments to use this same system instead, but perhaps RelEng will have a better way? cc @lbowmaker, @BTullis @Stevemunene, @Antoine_Quhen , @JAllemandou

Just re-read Tyler's comment from last year.

those jars would get stored on the gerrit host. On deployment (or fetch), each target would fetch the jar via a GET request to gerrit (is my rough mental model).

I think what he proposed with git-lfs should work. Can we amend his proposal and do it from gitlab instead? Gitlab's package registries are working well for us so far.

A nice thing about Archiva (and any other package repo) is the browsability. We could still do this with gitlab, but we'd I think we'd need a central gitlab repository that could host all artifacts and packages, rather than hosting them in each independent repo.

Also, a nice feature of our home grown artifact-syncing stuff, is that it uses fsspec, and works with any supported fsspec urls. This is actually an important feature for us; it allows us to sync artifacts directly to HDFS.

@Ottomata the context for this task is scap and deployment repositories rely on git-fat which does not support Python 3. Migrating the deployment repositories to use git-lfs is almost a drop-in replacement (both Scap and Gerrit have support for it). That short migration path is certainly less costly than migrating to Gitlab.

I am not sure how this task will progress, it will probably resurface as the target hosts are upgraded:

  • Debian Bullseye has an unsupported python 2.7 which is solely used to build some legacy packages. Though in practice it works
  • Debian Booksworm released last week no more includes python 2.7

In my experience Archiva has been largely a pain: it needs a different authentication system, the web UI is passable, requests made to it have a high latency etc. I have moved Gerrit deployment from Archiva to git-lfs and it gives a much better experience: git push refs/for/mastermagic happensdone). Then my use case was pretty narrow and I only used Archiva for deployment of Gerrit.

HDFS / fssspec, that sounds out of topic? I don't think deploying a repository with scap has anything to do with those? Or at least if we move from git-fat to git-lfs, I don't expect anything to change.

HDFS / fssspec, that sounds out of topic?

Probably a bit. We have an extra manually deploy step right now to deploy to HDFS. But also, there are probably more needs for deployment in our near future than regular filesystems (ceph?).

Archiva

We also use Archiva for publishing library artifacts that other code uses for dependencies. E.g. eventutilities is used by refinery-source. Gitlab has maven package registries, so we could probably pretty easy just publish to gitlab and use maven instead, but it would be nice to have a centralized place for this.

Anyway, I'm all for getting rid of Archiva if we have a working replacement.

git-lfs is almost a drop-in replacement (both Scap and Gerrit have support for it)

Sounds good, I'm all for it! However, I think its likely that we move to gitlab relatively soon, so maybe we can make git-lfs work with gitlab, and/or we can just use our artifact syncing stuff now for our purposes and be done with git-fat.

We ran into this issue with git-fat today, attempting to deploy analytics/refinery to hadoop-test, which has been upgraded to bullseye.

I manually installed git-fat to an-test-coord1001 (which already has python2.7 retained because of hive and hive-hcatalog). That fixed the immediate issue, but it's not a great fix and I agree that we need to find a solution to this soon.

That could be:

  1. migrating to git-lfs with Gerrit
  2. migrating to git-lfs with GitLab
  3. getting rid of git-fat and using our home-grown artifact syncing mechanism as described here T279509#8936938 by @Ottomata
  4. something else altogether

I'll make a ticket so that we can look into it for the analytics/refinery use case.

Ah, I see now that @demon has already helpfully created a ticket for us in T328472: analytics/refinery: Stop using git-fat - I'll use this.