Page MenuHomePhabricator

Migrate analytics/datahub pipeline to GitLab
Closed, ResolvedPublic

Description

We are ready to migrate the pipeline that builds our DataHub (https://datahub.wikimedia.org) packages to GitLab.

The source repository is https://gerrit.wikimedia.org/r/analytics/datahub and the .pipeline directory is in a branch named wmf

However we would like to migrate only the .pipline directory, which contains eight subdirectories, each of which contains only a blubber.yaml file.
The existing repository analytics/datahub was created as a fork of the upstream project: https://github.com/datahub-project/datahub

From that point until now, we have maintained our PipelineLib files and very minor modifications to the upstream source in this wmf branch.
Upgrading DataHub required pulling from upstream, rebasing the wmf branch against a new tag, fixing issues, force-pushing the wmf branch etc.
This process of maintaining our own fork was a little unweildy, so we have now modified all of the blubber.yaml files to make them totally independent of any accompanying source files.
They now clone the upstream datahub source from GitHub as part of the build process and we build against these pristine sources.

Therefore, I believe that we are now ready to migrate the eight blubber files to GitLab and the .pipeline/config.yaml to .gitlab-ci.yml.

I can create a repository at https://gitlab.wikimedia.org/repos/data-engineering/datahub and begin the process myself, if it helps, or I can pause and await further guidance from Release-Engineering-Team

Event Timeline

BTullis moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

@thcipriani - @hashar - are you both happy for me to start this process, as described above?
i.e. I'll make a new repo at: https://gitlab.wikimedia.org/repos/data-engineering/datahub and carry out the steps shown here.

We'll then be able to disable/archive analytics/datahub in gerrit.

If so, is there anything I else I should know about before starting?

@thcipriani - @hashar - are you both happy for me to start this process, as described above?
i.e. I'll make a new repo at: https://gitlab.wikimedia.org/repos/data-engineering/datahub and carry out the steps shown here.

We'll then be able to disable/archive analytics/datahub in gerrit.

If so, is there anything I else I should know about before starting?

I believe everything needed for migration of service pipelines exists and should work as documented, please proceed.

Tagging in @dancy and @dduvall from Release-Engineering-Team in case you run into any problems—they're the best resource for blubber/kokkuri/gitlab-ci.yml and admins in GitLab and Gerrit.

I have now prepared a merge request to enable the DataHub build pipeline under GitLab.
https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/merge_requests/1

It essentially duplicates the functionality of this: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/datahub/+/refs/heads/wmf/.pipeline/config.yaml
...although I have attempted to make a few improvements along the way.

I'd be very grateful if @dancy and/or @dduvall (or anyone else) would be kind enough to review it, please.

There are 8 containers built by the pipeline on each new commit.
When the commit is to a branch other than main, only the build stage runs.
When the commit is the result of a merge to main, the publish stage runs instead.

I have configured https://gitlab.wikimedia.org/repos/data-engineering/datahub with the following settings:

  • branch protection on the main branch
  • fast-forward-only merges permitted
  • squashing encouraged
  • pipelines must succeed

The README.md is a bit basic at the moment. I was wondering about adding a rule to prevent rebuilding the containers when only the README is updated, but I haven't done so yet.
Once it is merged, we will be able to archive/disable the analytics/datahub repository in gerrit.

Thank you for adding trusted runners to this repo.
On first run, it looks like there are a couple of issues to fix with the publish pipelines.

  1. It looks like gradle isn't happy withe proxy servers e.g. https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/jobs/126792#L501
  2. It looks like there is a problem with the tag format that I have attempted to use to keep it the same as our current containers: e.g. https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/jobs/126797#L748

I'll come back to these tomorrow.

  1. It looks like there is a problem with the tag format that I have attempted to use to keep it the same as our current containers: e.g. https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/jobs/126797#L748

The tag is okay but the image path (/wikimedia/datahub-kafka-setup) is not. Since your repo is named repos/data-engineering/datahub you can only publish images named repos/data-engineering/datahub or repos/data-engineering/datahub/something-below/....

I have created this merge request to address the two outstanding issues.
Unfortunately, I couldn't find a way to keep the build stage running on the shared runners with the additional options being passed to gradle, so now both build and publish stages are configured to run on the trusted runners.

If anyone would care to review, then we can merge it and verify that the publish stage works as expected. Thanks.

Change 942687 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Use the new DataHub images built with GitLab-CI

https://gerrit.wikimedia.org/r/942687

Change 942689 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Disable the build pipeline in zuul for datahub

https://gerrit.wikimedia.org/r/942689

I'll aim to deploy the newly built DataHub containers with this change on Monday.
We can archive the Gerrit repository now.

I'll aim to deploy the newly built DataHub containers with this change on Monday.
We can archive the Gerrit repository now.

Awesome! We can take care of that.

Change 942687 merged by jenkins-bot:

[operations/deployment-charts@master] Use the new DataHub images built with GitLab-CI

https://gerrit.wikimedia.org/r/942687

Change 943515 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump all datahub chart versions

https://gerrit.wikimedia.org/r/943515

Change 943515 merged by jenkins-bot:

[operations/deployment-charts@master] Bump all datahub chart versions

https://gerrit.wikimedia.org/r/943515

Change 943524 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove unnecessary datahub upgrade repository overrides

https://gerrit.wikimedia.org/r/943524

Change 942689 merged by jenkins-bot:

[integration/config@master] Disable the build pipeline in zuul for datahub

https://gerrit.wikimedia.org/r/942689

Change 943524 merged by jenkins-bot:

[operations/deployment-charts@master] Remove unnecessary datahub upgrade repository overrides

https://gerrit.wikimedia.org/r/943524

Mentioned in SAL (#wikimedia-releng) [2023-07-31T11:09:33Z] <James_F> Zuul: [analytics/datahub] Mark as archived, moved to GitLab in T341194

Mentioned in SAL (#wikimedia-releng) [2023-07-31T11:16:10Z] <James_F> jjb: Manually deleted ~30 jjb definitions for analytics/datahub, plus the now-empty datahub view in Jenkins, for T341194

Oh, I'm getting a crashloop from the mce-container:
It looks like the jmx_prometheus_javaagent.jar isn't present or is in the wrong location.

btullis@deploy1002:~$ kubectl logs -f datahub-mce-consumer-main-86486f9c95-hgfh7 datahub-mce-consumer-main 
2023/07/31 11:14:50 Waiting for: tcp://kafka-test1006.eqiad.wmnet:9092
2023/07/31 11:14:50 Connected to tcp://kafka-test1006.eqiad.wmnet:9092
Error opening zip file or JAR manifest missing : jmx_prometheus_javaagent.jar
Error occurred during initialization of VM
agent library failed to init: instrument
2023/07/31 11:14:50 Command exited with error: exit status 1

Change 943567 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Deploy new datahub images

https://gerrit.wikimedia.org/r/943567

Change 943567 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy new datahub images

https://gerrit.wikimedia.org/r/943567

This is now done. We have fully deployed the new images generated by GitLab-CI.

I'll aim to deploy the newly built DataHub containers with this change on Monday.
We can archive the Gerrit repository now.

Awesome! We can take care of that.

Archived this today. Everything still exists in this repo, but it's now read only and HEAD is set to a new branch with a pointer to the location on GitLab https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/datahub/