Page MenuHomePhabricator

Create a Maven package registry in Gitlab
Closed, ResolvedPublic

Description

We want to migrate away from Archiva and replace it with Gitlab as a Maven package registry. See parent task (T367315) and doc for more details.

We expect to host ~50G of packages over the next 5 years. We will not migrate the historical releases from Archiva to Gitlab.

My understanding is that we can create a global package registry in Gitlab, but if that's not the case, we could create one that is part of the ci-tools group or maybe a dedicated group. We do NOT want to have a registry per repo, and we would like to avoid having multiple registries linked to multiple groups. Having a single registry allow us to have that configuration centralized and transparent to all projects that need to use those dependencies.

Eventually, we want CI to upload artifacts to the registry, so most users should not need write access. During the implementation phase, to allow for experimentation, we want a few users to be able to upload packages manually for testing.

The registry should be readable anonymously by anyone.

AC:

Follow up steps for backups, can be used here or moved to a separate follow up task (copying to description so that it doesn't get lost):

  • Once Archiva content is static, make a snapshot
  • Disable regular backup jobs
  • Once mirrored content is purged, make a snapshot
  • Ensure the backups are no longer rotated / are retained forever

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I like this idea.

Because a repo can host different kinds of packages (npm, pip, maven, etc.), perhaps this global registry repo could be used for more than just maven/java?

I don't think there would be anything different to do for the archiva replacement other than do a little name bikeshedding. Actually using the repo for other things would be in different tasks, e.g. T366612: Publish Data Engineering maintained NodeJS packages to GitLab and use them in depender code.

Because a repo can host different kinds of packages (npm, pip, maven, etc.), perhaps this global registry repo could be used for more than just maven/java?

We should probably try to isolate different type of packages. If I understand correctly, Gitlab has native support for a number of package repositories (npm, python, ...).

We should probably try to isolate different type of packages

Possibly! Why do you think so?

IIUC, this repo would mostly be just a place to publish packages. Archiva was also previously used to host python pip packages.

I think each repo can have multiple kinds of package registries. I think (could be wrong here) the registries themselves would still be 'separate', they'd just be contained in one gitlab repo/project.

Yes, if we keep different kind of artifacts separated in different package registries, I have no objection. At the moment, I'd like to focus on the Maven part and hope that we don't abuse it as much as we've abused Archiva!

A potential use case we have:

We'd like to move host eventutilities-python and wikimedia-eventutilities (Java) in the same repo in gitlab, and publish both maven and pip packages from it. These could be configured to publish to different global GitLab repos, but centralizing instructions for how to publish and depend on Wikimedia published packages sounds nice.

At the moment, I'd like to focus on the Maven part and hope that we don't abuse it as much as we've abused Archiva!

Agree. I think the only work-to-do implications of my suggestion are:

  • Investigate the feasibility/desirability of using one global gitlab repo for many types of packages
  • If we want it, then do a little repo name bikeshedding.
Gehel triaged this task as High priority.Jun 13 2024, 1:16 PM

a few thoughts here:

  • I like the idea of moving from Archiva to GitLab, it's a good opportunity to consolidate
  • Archiva looks like it's currently 120G, we currently have ~500G of space on the gitlab boxes, so this would be 20% of our available space. There was some preliminary discussion of adding object storage for GitLab artifacts (but I'd like collaboration-services to comment on that as well as what this would do for backup timing)—it seems possible to do this given space constraints.
  • It might be tricky to get permissions and repository structure right, we could talk about that more.

Questions:

  • Does the 120G include maven mirroring? Will that continue?
  • Archiva looks like it's currently 120G, we currently have ~500G of space on the gitlab boxes, so this would be 20% of our available space. There was some preliminary discussion of adding object storage for GitLab artifacts (but I'd like collaboration-services to comment on that as well as what this would do for backup timing)—it seems possible to do this given space constraints.

We expect 50G of space used over the next 5 years. We don't want to migrate existing releases and we will stop doing mirroring.

Questions:

  • Does the 120G include maven mirroring? Will that continue?

No, we will stop doing mirroring.

I should have been clear that GitLab has ~500GB remaining not total, so that 20% from my above comment is not 20% of the disk, but 20% of remaining disk:

Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/gitlab1004--vg-root  813G  187G  585G  25% /
/dev/md1                         1.8T  324G  1.4T  20% /srv/gitlab-backup

Adding 50GB over five years sounds reasonable. Tagging in collaboration-services to weigh in.

A few more questions around backups, to make sure we don't surprise Data Persistence. What are the expectations for backing up the artifacts, would they be easier to restore or regenerate? Is the current ~120GB backed up and if yes, can it be purged after the migration or should it be retained?

One other clarifying question, does "50G of space used over the next 5 years" mean:

  • Start from 0 and get to 50 over 5 years
  • Start from 50 and maintain this over 50

A few more questions around backups, to make sure we don't surprise Data Persistence. What are the expectations for backing up the artifacts, would they be easier to restore or regenerate? Is the current ~120GB backed up and if yes, can it be purged after the migration or should it be retained?

Having backups would be super useful. We should be able to create new releases if needed (but it is a cumbersome process, where we would need to identify dependencies and rebuild in the correct order). Regenerating old artefact is not possible (at least not bit-for-bit identical artifacts, which is a problem).

I'm checking if we currently have backups on Archiva.

One other clarifying question, does "50G of space used over the next 5 years" mean:

  • Start from 0 and get to 50 over 5 years
  • Start from 50 and maintain this over 50

This means starting from 0 and get to 50 over the next 5 years. (obviously an estimates, predictions are hard, especially about the future)

Current archiva is backed up. We should keep those backups. Note that the archiva data will be kept and exposed with a simple web server. If size is an issue, we could exclude mirrored artifacts from the backups.

So just to make sure it's 100% clear, would the preference be to keep a backup of the final state of Archiva (minus mirrored artifacts) and also separately keep backups of what's in GitLab?

So just to make sure it's 100% clear, would the preference be to keep a backup of the final state of Archiva (minus mirrored artifacts) and also separately keep backups of what's in GitLab?

Yes, correct.

Minor precision, the final state of Archiva will also be kept available publicly behind a simple web server.

Backup steps that would likely be needed:

  • Once Archiva content is static, make a snapshot
  • Disable regular backup jobs
  • Once mirrored content is purged, make a snapshot
  • Ensure the backups are no longer rotated / are retained forever

This task is mostly done, but let's keep it open until we've published at least 1 project to Gitlab, with a CI workflow, to validate that this works outside of a simple test project uploaded manually.

LSobanski updated the task description. (Show Details)
Gehel renamed this task from Create a global Maven package registry in Gitlab to Create a Maven package registry in Gitlab.Jul 16 2024, 2:13 PM

It seems that the Gitlab model is to publish artifacts per project. At the group level, a read only package registry is available, which aggregates all the projects under that group. This works well for projects hosted on Gitlab, but asks the question of what do we do for projects hosted on Gerrit. We still need to publish them to Gitlab. Note that I don't fully understand how Gitlab works, I might be missing something.

Proposal:

  • create a specific project in gitlab for hosting external artifacts
  • create a system user in gitlab for uploading artifacts to the package registry of that project
  • modify the current Gerrit / Zuul / Jenkins workflows to use that new user to upload artifacts to gitlab

@brennen: does the above sound reasonable?

does the above sound reasonable?

I think so, yes. I'm not quite sure where the token would need to live in that scenario.

Things currently in the registry under /repos: https://gitlab.wikimedia.org/groups/repos/-/packages

Wow TIL. Very cool. Is it possible to use this as a global place to depend on packages from? Could I use this somehow instead of a project specific registry, e.g. here?

  • create a specific project in gitlab for hosting external artifacts
  • create a system user in gitlab for uploading artifacts to the package registry of that project

+1. Can we consider doing naming these not being java/maven specific, as asked here? https://phabricator.wikimedia.org/T367322#9884999

Investigate the feasibility/desirability of using one global gitlab repo for many types of packages
If we want it, then do a little repo name bikeshedding.

+1. Can we consider doing naming these not being java/maven specific, as asked here? https://phabricator.wikimedia.org/T367322#9884999

I'm not going to stop anyone from using a single project as a catch-all, but I do wonder if we'd come to regret this in a "working against the grain of the tool" sort of way. I don't have enough experience with the package registry feature to say one way or the other. Maybe it's totally fine, but I'm curious what upstream would suggest...

...

Well, I looked around and there's this: https://docs.gitlab.com/ee/user/packages/workflows/project_registry.html

Store all of your packages in one GitLab project

You can store all of your packages in one project’s package registry. Rather than using a GitLab repository to store code, you can use the repository to store all your packages. Then you can configure your remote repositories to point to the project in GitLab.

You might want to do this because:

  • You want to publish your packages in GitLab, but to a different project from where your code is stored.
  • You want to group packages together in one project. For example, you might want to put all npm packages, or all packages for a specific department, or all private packages in the same project.
  • When you install packages for other projects, you want to use one remote.
  • You want to migrate your packages from a third-party package registry to a single place in GitLab and do not want to worry about setting up separate projects for each package.
  • You want to have your CI/CD pipelines build all of your packages to one project, so the person responsible for validating packages can manage them all in one place.

So it seems like it's at least a supported workflow.

So it seems like it's at least a supported workflow.

So...let's do it! :)

We have a registry in place for a test project. The scheme can be replicated to more projects as we go.

For packages hosted on Gerrit, we should publish all of them to the same package registry on Gitlab: https://gitlab.wikimedia.org/repos/maven/

Okay, so it looks the answer to my question

+1. Can we consider doing naming these not being java/maven specific, as asked here?
https://phabricator.wikimedia.org/T367322#9884999

was no?

When we want to have a global registry for python or javascript packages, we should then create a new repo in gitlab to host them?

Reposting a use case from https://phabricator.wikimedia.org/T367322#9884996

wikimedia-event-utilities is both a Java and a Python project. The python code currently lives in a separate repo, but we would like to have them in the same repo one day. It would be nice if they published to the same global package registry.

For packages hosted on Gerrit, we should publish all of them to the same package registry on Gitlab: https://gitlab.wikimedia.org/repos/maven/

That's actually a meta-registry, which aggregates the projects underneath it. If i follow the docs correctly, we can only publish directly to a project and not to the meta-registry. So we would need to choose an appropriate sub-project.

I'm also not quite clear on where the secret lives in the jenkins side, or how that flows through the system. Afaict the archiva credentials were provided (T132178) at https://integration.wikimedia.org/ci/configfiles/ for storage. I don't have access to see/change that, but my understanding of how that worked for archiva is we provide a name there and then jjb config can write out a .xml file with the secret which is then used during the build. The patches I've seen so far on the new release process are for providing a CI_RELEASE_TOKEN env-var during the build, but I'm not sure where that would be configured. I'm imagining that gets integrated into the existing {name}-maven-release jjb template? One thing to keep in mind is that the token is per-project being published to, it might make our lives a bit easier to have a single project, and thus a single token, for packages published from gerrit to gitlab.

In terms of releasing a java package from gerrit to the gitlab repository, my current understanding of the desired process from a package maintainer perspective (of something already releasing through archiva) is as follows. Is this correct?

  • Update the package to a recent version of wmf-jvm-parent-pom
  • Set gitlab.* (from parent-pom) maven parameters to the appropriate project to publish to
  • Ensure the {name}-maven-release jjb template is applied
  • Trigger the build from jenkins ui

Reposting a use case from https://phabricator.wikimedia.org/T367322#9884996

wikimedia-event-utilities is both a Java and a Python project. The python code currently lives in a separate repo, but we would like to have them in the same repo one day. It would be nice if they published to the same global package registry.

Naming is such a hard thing, If i had to guess, some of the name requirements might be:

  • Not overly specific to gerrit, but not suggesting gitlab-based repos need to publish there
  • Not overly specific to a particular packaging method, gitlab handles multiple types of repositories and it would be nice to take advantage of that

Naming goals:

  • Should be related to packaging / publishing
  • How to refer to gerrit but not specifically? Can call them external, but that feels a little like like external to wmf.
  • Keep in mind gitlab offers the meta-repos, so a/b/c can be accessed as a, a/b, or a/b/c.

Maybe we could do something like repos/packages/gerrit, typically accessed as repos/packages. There is always the option to access via the repos level meta-repository which will include everthing published under it.

I think we should have 1 WMF wide global package registry in GitLab.
I think one registry would make help simplify configuration. Publishing and dependency configuration could then all use the same package registry location configuration.

Are there any downsides of having 1 global package registry?

If we do multiple registries, at least we can declare dependencies using a meta-repo or global level. If we decide to do multiple registries, we should have documented guidance on how developers should depend on packages. It would be better to depend on them from a global or meta level, rather than hardcoding specific projet registry URLs across dependent poms and setup.pys and package.jsons out there.

I suppose i was thinking the default in gitlab would be to publish to the repo itself, using the meta-repo to pull related packages together. For example repos/search-platform/HebMorph would publish to itself and typically be accessed from the repos/search-platform meta-repo. I like this in part because it keeps everything self-contained. I have very different requirements around reuse though, most of my packages aren't going to be referenced via maven.

In existing gitlab projects I've done this by creating a per-project access token and making it accessible to the release publishing task as a per-project ci variable. This access token is already needed to push git tags, so it's not really any additional CI configurationi beyond adding the write_api bit to the token. This is entirely self-service as long as you have maintainer access to the repo. Can read the docs and publish a package in an afternoon.

I'm not opposed to having a shared repo and having gitlab projects publish to it as well, but I would want to see it retain that same ease of integration. I expect the access token to the central repo would be the only real stumbling block. I see in the gitlab docs we can have instance, group, or project level ci variables. I suppose in theory a group-level ci variable at the repos/ level could provide the publishing token to everything that needs it. The token would need api write access, which means everything under repos/ will have api write access to the publishing repo. Probably acceptable?

I expect the access token to the central repo would be the only real stumbling block.

Indeed, there would be some process overhead for a new package to be published to the central registry. This is probably good. While a package is more in development or for single use, they could still be published to a specific repo.

For packages intended for broad use, a single central registry would be nice.

This comment was removed by Gehel.

We still need a name to move this forward.

/repos/packages?

If @Gehel and @brennen generally agree, that seems okay? Perhaps putting WMF or wikimedia in the name could be useful?

  • /repos/wmf-packages
  • /repos/wikimedia-packages

Then to configure pip to use the package registry:

--extra-index-url https://gitlab.wikimedia.org/api/v4/projects/repos%2Fwikimedia-packages/packages/pypi/simple

And for maven, e.g.:

<repositories>
  <repository>
    <id>gitlab-maven</id>
    <url>https://gitlab.wikimedia.org/api/v4/projects/repos%2Fwikimedia-packages/packages/maven</url>
  </repository>
</repositories>

Sounds reasonable to me, i would go with wmf-packages to keep things slightly shorter

Sounds reasonable to me, i would go with wmf-packages to keep things slightly shorter

+1