Page MenuHomePhabricator

Review and improve the build process for DataHub containers
Closed, ResolvedPublic

Description

Background

We have selected DataHub as the third-party tool that we wish to use as a Data Catalog.
It faciliates data lifecycle management, governance procedures,data discovery, self-documentation and a host of other features.
It supports our required licencing model, being Apache 2.0 licensed.

Deployment

The front end is available here: https://datahub.wikimedia.org and the documentation is on wikitech https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub

It is deployed in a hybrid model, partly on wikikube, with persistent datastores on bare metal and VMS.
We currently run DataHub in the production realm, since it is integrated with various back-end data sources and sinks, such as Kafka, Druid, Hive, and Superset.
It does not handle nor provide proxy access to any private data itself, but rather it functions as a metadata store about these various data stores and the pipelines that move data between them.

For the most part, Kafka-jumbo is used as a message bus to get metadata changes in and out of DataHub, although applications and pipelines may also push data directly to the DataHub GMS API on https://datahub-gms.discovery.wmnet:30443

Upstream Release Artifacts

The project publishes the following release artifacts:

  1. Source code zips and tarballs on GitHub : https://github.com/datahub-project/datahub/releases
  2. Docker containers on Docker Hub, published by acryldata : https://hub.docker.com/u/acryldata
  3. Python clients are published on Pypi: https://pypi.org/project/acryl-datahub/
  4. Several small binary jars for specific features such as Spark integration, Airflow integration etc appear to be published, but I am not sure where.

Building DataHub

Given that we run DataHub in the production realm, our security policies prevent us from running the Docker containers that are published to Docker Hub.
This means that the only way we can obtain DataHub containers and the binaries they contain is if we build them ourselves.

In order to support the deployment so far, we have therefore created our own fork of the datahub respository.
We have created a wmf branch within which we have commited our changes from the upstream main branch.

We have used Blubber and the Deployment Pipeline to our build of DataHub on Debian and our pre-existing Production Images.
With the exception of a change to the Datahub logo and favicon, which are not essential, every change we have so far made to the upstream code has been insupport of this parallel build process. We do not anticipate diverging from the upstream codebase in any way other than to support this build process.

Problem Statement

The problem today is that as the upstream project develops quickly, it becomes increasingly difficult and time-consuming for us to maintain this parallel build process.
The upstream build process, which is a part of the main codebase, is closely tied to Docker and Alpine linux, so this means on each point release we have to assess what changes might have been made, backport these changes to the blubber files, then also backport them to our bespoke helm chart.

This is time-consuming, not because we have to wait for many builds to complete in CI (which we do), but because it requires us to examine the changes carefully and replicate the beahviour with blubber and the deployment pipeline.

Potential solutions

It is understandable that the SRE team does not wish for a proliferation of operating systems to exist in the production realm to occur.
Current policy requires that every container must be based on a Debian based production image.

We could investigate any of the following methods to address the problem:

  1. Consider the Alpine based images released on Docker Hub to be a permissible source of application binaries and supporting files. This commit provides an example of how we might do that with the DockerHub based images, simplt copying binaries from Alpine images to Debian images. Initial inspection suggests that they *should* work, given that they are Java binaries and do not use JNI.
  2. Replicate the upstream build process on our own version of Alpine and then copy the necessary files from those ephemeral artifacts to Debian images.
  3. Move our deployment of DataHub out of the production realm, instead running it under WMCS where the risks associated with using third-party containers are lower.
  4. Review the policy around Alpine containers in production and make an exception for DataHub, allowing us to build our own version with an unmodified build process from upstream.

I'm sure that there are other potential solutions and I welcome any suggestions.

Event Timeline

Another idea: don't build the project when building the images.
Our ‘fork’ would just have some tooling to build (and upload?) a release of the build artifacts. A separate repo would be responsible for .pipeline/ blubber stuff that would know how to download the build artifacts and put them in the images. This would avoid having to run gradelew every time you make a change to .pipeline related stuff.

Yes, that sounds like it would work.
In fact I'm not sure that we're going to have to modify the code much anyway for the MVP.
The one file that I know we will have to change is the jaas.conf file, which configures authentication for the frontend.
With that one though, perhaps we could just mount keep this file somewhere else and mount it as a volume from the helm chart.

So maybe this streamlining is something that can wait until the MVP is done anyway, and we put it in the productinization pile column.

Oh look, upstream has some GitHub workflow configuration for publishing jars:
https://github.com/linkedin/datahub/blob/master/.github/workflows/publish-datahub-jars.yml

I don't think that they're using it for publishing any public artifacts on GitHub, but we might be able to adapt it to publish to GitLab or Archiva.

EChetty set the point value for this task to 2.5.
EChetty changed the point value for this task from 2.5 to 1.5.

Repurposing this ticket to find a solution to the problem of the build process of DataHub.

BTullis renamed this task from Streamline CI for our fork of DataHub to Review and improve the build process for DataHub.Mar 23 2023, 5:59 PM
BTullis triaged this task as High priority.
BTullis removed a subscriber: EChetty.

Change 903262 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Tweak the build process and and fix local container builds

https://gerrit.wikimedia.org/r/903262

Change 903262 merged by jenkins-bot:

[analytics/datahub@wmf] Tweak the build process and and fix local container builds

https://gerrit.wikimedia.org/r/903262

BTullis renamed this task from Review and improve the build process for DataHub to Review and improve the build process for DataHub containers.Mar 30 2023, 3:13 PM
BTullis claimed this task.
BTullis updated the task description. (Show Details)
BTullis removed the point value for this task.
BTullis added subscribers: Joe, akosiaris, elukey and 2 others.
BTullis added a subscriber: ayounsi.
BTullis added a subscriber: Stevemunene.
BTullis added a subscriber: JMeybohm.

Change 935788 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Begin un-forking datahub from the upstream

https://gerrit.wikimedia.org/r/935788

Change 935788 merged by jenkins-bot:

[analytics/datahub@wmf] Begin un-forking datahub from the upstream

https://gerrit.wikimedia.org/r/935788

I'm going to say that this is done. It's much better than it was, cleaner and easier to maintain.
I've effectively un-forked our build process from the datahub source code, so that each blubber definition uses a pristine copy of the upstream datahub source.
Therefore, we no longer have to maintain these blubber files within a branch in our own fork of DataHub.

The next step will be to complete the un-forking by moving all of the blubber files into a new repository as part of T332953: Migrate PipelineLib repos to GitLab

I've created this ticket T341194: Migrate analytics/datahub pipeline to GitLab for migrating the DataHub build pipeline to GitLab.