Page MenuHomePhabricator

[NEEDS GROOMING] Data pipelines should be published to archiva
Open, Needs TriagePublic

Description

User Story
As a platform engineer I need to deploy data pipeline code and dependencies to production infra, using generally available tools and practices
Success Criteria
  • Data pipelines modules and dependencies are published to Archiva
  • Data pipelines modules and dependencies are automatically deployed on build

Notes for grooming

  • This is a requirement for https://phabricator.wikimedia.org/T295807
  • Need to make sure we can auth to archiva
  • The dep management model won't change, we'll just need to publish conda-env artifacts and jars to archiva instead of Gitlab.

Event Timeline

I dunno, @mforns we might be able to support a Gitlab artifact source too.

Why not! I don't think it would be super hard, no?
The important thing I guess is to come up with a set of guidelines on how a packaged env should look like.
If a package fulfills that, then the place where it comes from doesn't matter that much no?

@Ottomata @mforns happy to let you guys drive here.

If archiva is the canonical place to store deps, we'll adopt it. But if you feel like experimenting with Gitlab in your toolchain, happy to be your guinea pig :)

What I like about Gitlab is the easy CI integration (once we have runners), and having everything in one place. What I don't like is that the registry is defined per repo (might be configurable at group/org level, don't know).

@mforns +1 on focusing on guidelines.

So far, our WIP dependency resolution mostly just works with URLs, so if gitlab has an accessible URL for your artifact, I think you'll be able to use it.

The Archiva/Maven integration we are working it just resolves Maven coordinates to URLs on archiva.wikimedia.org.

Soooo...TBD!

@gmodena can you give me an example of an existent gitlab hosted artifact URL?

gmodena renamed this task from Data pipelines should be published to archiva to [NEEDS GROOMING] Data pipelines should be published to archiva.Mar 1 2022, 8:24 AM
gmodena added a project: Data Pipelines.

Not sure if this is still relevant since Gitlab provides registries and artifacts management.

Ya, I'd say archiva will be most useful if you are already publishing JVM artifacts there. You can upload other things (conda envs, etc.) but if you can do that from gitlab, gitlab might even be better. Archiva is hosted on a single node (with backups), but if it goes down for any reason, artifacts won't be available. I don't know what kind of HA gitlab provides tho.