Page MenuHomePhabricator

analytics/refinery: Stop using git-fat
Closed, ResolvedPublic

Description

Currently analytics/refinery uses this scap config to deploy.

The git-binary-manager is defined as git-fat - but this tool is already deprecated and requires python 2.7.

We need to replace it and several options have been discussed here: T279509#9109981

  1. migrating to git-lfs with Gerrit
  2. migrating to git-lfs with GitLab
  3. getting rid of git-fat and using our home-grown artifact syncing mechanism as described here T279509#8936938 by @Ottomata
  4. something else altogether

Event Timeline

BTullis added subscribers: Milimetric, JAllemandou, mforns and 7 others.

Adding more tags and subscribers so that we can gain visibility and prioritise this within the DPE group.
I'm not necessarily sure yet who is best placed to carry out the work to replace git-fat with something else, but we can discuss options here first.

I would rephrase the problem as: Why do we need to keep generated artifacts inside our version control solution?

If we migrate analytics/refinery to Gitlab, we can leverage Gitlab's Generic Package Repository to deal with artifacts separately. This is what we do in projects such as conda-analytics, and airflow-dags. It works great, and keeps those two concerns separate: source control and artifact control.

I would rephrase the problem as: Why do we need to keep generated artifacts inside our version control solution?

If we migrate analytics/refinery to Gitlab, we can leverage Gitlab's Generic Package Repository to deal with artifacts separately. This is what we do in projects such as conda-analytics, and airflow-dags. It works great, and keeps those two concerns separate: source control and artifact control.

This sounds like a great solution. Is it the refinery-source project that publishes the artifacts, rather than refinery?

If so, would we have to migrate both projects at the same time, or would it be easier to do it piecemeal?

I would rephrase the problem as: Why do we need to keep generated artifacts inside our version control solution?

We have jobs relying on the jars that are not yet on Airflow (and therefore using the artifacts mechanism. The biggest/most important set of those jobs is the refine jobs class, but there might be others.
Also, we need a way to provide access to the jars to users of the cluster, for manual jobs (using UDFs, or other code from the jar). Currently this is done through the refine-repo deployment (on stats machine and on HDFS). Before removing the artifacts from the refine repo or deprecating it, we would need to find another way for users to access artifacts.

Currently this is done through the refine-repo deployment (on stats machine and on HDFS). Before removing the artifacts from the refine repo or deprecating it, we would need to find another way for users to access artifacts.

Agreed that we would need to brainstorm how to do the transition. One quick idea would be to modify our scap scripts to pull artifacts from Gitlab.

Certainly such a migration would be fun! :D

Hi folks. scaps git-lfs support has been fixed so we can migrate analytics/refinery to git-lfs. To enable LFS for this repo in Gerrit, I need to know what your maximum object size is so I can set a limit.

Hi folks. scaps git-lfs support has been fixed so we can migrate analytics/refinery to git-lfs. To enable LFS for this repo in Gerrit, I need to know what your maximum object size is so I can set a limit.

Nevermind. I reconstituted the analytics/refinery repo and checked out the fat objects. Looks like the biggest single object is ~157MB, so I'll set the initial object size limit to 500MB.

Btw, I notice that there are many versions of some artifacts stored in the repository. Are they all used at runtime? If not, it would be better to only include the ones that are actually used, to save time pulling down artifacts that won't actually be used.

Change 1007666 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[All-Projects@refs/meta/config] Add LFS config for analytics/refinery

https://gerrit.wikimedia.org/r/1007666

Change 1007666 merged by Ahmon Dancy:

[All-Projects@refs/meta/config] Add LFS config for analytics/refinery

https://gerrit.wikimedia.org/r/1007666

dancy triaged this task as Medium priority.

Change 1007690 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[analytics/refinery@master] Switch from git-fat to git-lfs

https://gerrit.wikimedia.org/r/1007690

Change 1007692 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[analytics/refinery/scap@master] scap.cfg: Use git-lfs instead of git-fat

https://gerrit.wikimedia.org/r/1007692

Change #1007690 merged by Btullis:

[analytics/refinery@master] Switch from git-fat to git-lfs

https://gerrit.wikimedia.org/r/1007690

Mentioned in SAL (#wikimedia-analytics) [2024-03-28T16:22:20Z] <btullis> deploying refinery to test the git-lfs integration with scap for T328472

Change #1007692 merged by Btullis:

[analytics/refinery/scap@master] scap.cfg: Use git-lfs instead of git-fat

https://gerrit.wikimedia.org/r/1007692

Change #1017027 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] dockerfiles: jar-updater remove git fat and update OS

https://gerrit.wikimedia.org/r/1017027

Change #1017027 merged by jenkins-bot:

[integration/config@master] dockerfiles: jar-updater remove git fat and update OS

https://gerrit.wikimedia.org/r/1017027

Change #1017051 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] dockerfiles: set git remote in ci-src-setup

https://gerrit.wikimedia.org/r/1017051

Change #1017053 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: update-jars no more need to set the git remote

https://gerrit.wikimedia.org/r/1017053

@Sfaci mentioned the analytics-refinery-update-jars-docker job ended up failing with:

+ git fetch --quiet --update-head-ok --depth 2 https://maven-release-user@gerrit.wikimedia.org/r/analytics/refinery +master:master
+ [[ master == '' ]]
+ git checkout -B master FETCH_HEAD
Downloading artifacts/article-recommender/venv-0.0.1.zip (55 MB)
Error downloading object: artifacts/article-recommender/venv-0.0.1.zip (e2e8ee8): Smudge error: Error downloading artifacts/article-recommender/venv-0.0.1.zip (e2e8ee8719c7dc4bec09612e9a58af21e932c14b4bfe44e6a15a876644ad2b28): batch request: missing protocol: ""

Errors logged to /src/.git/lfs/logs/20240403T083712.948717901.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: artifacts/article-recommender/venv-0.0.1.zip: smudge filter lfs failed
Build step 'Execute shell' marked build as failure

The important bits are missing protocol: "" which indicate LFS was not able to determine the remote endpoint to interact with. Short of adding a .lfsconfig file as recommended at https://wikitech.wikimedia.org/wiki/Git-lfs, we can rely on git LFS forging the remote endpoint from the git remote. The twist is our CI script does a git init && git fetch && git checkout and never set a git remote which leads to the issue.

I have made two patches which should address it:
Gerrit 1017051 dockerfiles: set git remote in ci-src-setup: which git remote add before fetching. That is done in the image.
Gerrit 1017053 jjb: update-jars no more need to set the git remote: switch the job to the new image and remove the useless git remote add from the job since that is now done earlier by the container.

Change #1017051 merged by jenkins-bot:

[integration/config@master] dockerfiles: set git remote in ci-src-setup

https://gerrit.wikimedia.org/r/1017051

Mentioned in SAL (#wikimedia-releng) [2024-04-04T14:59:47Z] <dancy> Updating docker-pkg files on contint primary for T328472

Change #1017053 merged by jenkins-bot:

[integration/config@master] jjb: update-jars no more need to set the git remote

https://gerrit.wikimedia.org/r/1017053

Change #1017098 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] dockerfiles/ci-src-setup-simple: Update to 0.6.1

https://gerrit.wikimedia.org/r/1017098

Change #1017098 merged by jenkins-bot:

[integration/config@master] dockerfiles/ci-src-setup-simple: Update to 0.6.1

https://gerrit.wikimedia.org/r/1017098

Change #1017100 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] jjb: Bump ci-src-setup-simple from 0.6.0 to 0.6.1

https://gerrit.wikimedia.org/r/1017100

Change #1017100 merged by jenkins-bot:

[integration/config@master] jjb: Bump ci-src-setup-simple from 0.6.0 to 0.6.1

https://gerrit.wikimedia.org/r/1017100

Mentioned in SAL (#wikimedia-releng) [2024-04-04T16:00:55Z] <dancy> Updating docker-pkg files on contint primary for T328472

Change #1017103 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] dockerfiles: Bump ci-buster and ci-src-setup-simple

https://gerrit.wikimedia.org/r/1017103

Change #1017103 merged by jenkins-bot:

[integration/config@master] dockerfiles: Bump ci-buster and ci-src-setup-simple

https://gerrit.wikimedia.org/r/1017103

Change #1017104 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] jjb: Bump ci-src-setup-simple from 0.6.1 to 0.6.2

https://gerrit.wikimedia.org/r/1017104

Change #1017104 merged by jenkins-bot:

[integration/config@master] jjb: Bump ci-src-setup-simple from 0.6.1 to 0.6.2

https://gerrit.wikimedia.org/r/1017104

Change #1017108 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] dockerfiles: jar-updater 0.1.2: Include curl package

https://gerrit.wikimedia.org/r/1017108

Change #1017108 merged by jenkins-bot:

[integration/config@master] dockerfiles: jar-updater 0.1.2: Include curl package

https://gerrit.wikimedia.org/r/1017108

Change #1017110 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[analytics/refinery@master] Convert remaining git-fat files to git-lfs

https://gerrit.wikimedia.org/r/1017110

Change #1017110 merged by Ahmon Dancy:

[analytics/refinery@master] Convert remaining git-fat files to git-lfs

https://gerrit.wikimedia.org/r/1017110

Change #1017112 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] jjb: analytics.yaml: Bump to docker-registry.wikimedia.org/releng/jar-updater:0.1.2

https://gerrit.wikimedia.org/r/1017112

Change #1017112 merged by jenkins-bot:

[integration/config@master] jjb: analytics.yaml: Bump to docker-registry.wikimedia.org/releng/jar-updater:0.1.2

https://gerrit.wikimedia.org/r/1017112

Change #1017113 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[analytics/refinery@master] bin/update-refinery-source-jars: git-fat to git-lfs

https://gerrit.wikimedia.org/r/1017113

Change #1017113 merged by Ahmon Dancy:

[analytics/refinery@master] bin/update-refinery-source-jars: git-fat to git-lfs

https://gerrit.wikimedia.org/r/1017113

Change #1017114 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[analytics/refinery@master] bin/update-refinery-source-jars: Pass --local to git lfs install

https://gerrit.wikimedia.org/r/1017114

Change #1017114 merged by Ahmon Dancy:

[analytics/refinery@master] bin/update-refinery-source-jars: Pass --local to git lfs install

https://gerrit.wikimedia.org/r/1017114

@Sfaci The issue with the analytics-refinery-update-jars-docker job should be resolved now. https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/108/console has the output of a test run. Note that it created https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1017074.

The job finished great! I can keep working on the deployment that was blocked for this.
Thank you very much!!

Thank you so much @hashar for unblocking us!

Do celebrate @dancy who has done the first pass of the migration and reviewed+deployed the patches I have sent :-]

Hopefully that is enough to mark this Resolved.

Thanks so much @dancy and @hashar and everyone else who has helped.
I believe that this is resolved. If I'm wrong about that, please feel free to reopen.