Background
We have selected DataHub as the third-party tool that we wish to use as a Data Catalog.
It faciliates data lifecycle management, governance procedures,data discovery, self-documentation and a host of other features.
It supports our required licencing model, being Apache 2.0 licensed.
Deployment
The front end is available here: https://datahub.wikimedia.org and the documentation is on wikitech https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub
It is deployed in a hybrid model, partly on wikikube, with persistent datastores on bare metal and VMS.
We currently run DataHub in the production realm, since it is integrated with various back-end data sources and sinks, such as Kafka, Druid, Hive, and Superset.
It does not handle nor provide proxy access to any private data itself, but rather it functions as a metadata store about these various data stores and the pipelines that move data between them.
For the most part, Kafka-jumbo is used as a message bus to get metadata changes in and out of DataHub, although applications and pipelines may also push data directly to the DataHub GMS API on https://datahub-gms.discovery.wmnet:30443
Upstream Release Artifacts
The project publishes the following release artifacts:
- Source code zips and tarballs on GitHub : https://github.com/datahub-project/datahub/releases
- Docker containers on Docker Hub, published by acryldata : https://hub.docker.com/u/acryldata
- Python clients are published on Pypi: https://pypi.org/project/acryl-datahub/
- Several small binary jars for specific features such as Spark integration, Airflow integration etc appear to be published, but I am not sure where.
Building DataHub
Given that we run DataHub in the production realm, our security policies prevent us from running the Docker containers that are published to Docker Hub.
This means that the only way we can obtain DataHub containers and the binaries they contain is if we build them ourselves.
In order to support the deployment so far, we have therefore created our own fork of the datahub respository.
We have created a wmf branch within which we have commited our changes from the upstream main branch.
We have used Blubber and the Deployment Pipeline to our build of DataHub on Debian and our pre-existing Production Images.
With the exception of a change to the Datahub logo and favicon, which are not essential, every change we have so far made to the upstream code has been insupport of this parallel build process. We do not anticipate diverging from the upstream codebase in any way other than to support this build process.
Problem Statement
The problem today is that as the upstream project develops quickly, it becomes increasingly difficult and time-consuming for us to maintain this parallel build process.
The upstream build process, which is a part of the main codebase, is closely tied to Docker and Alpine linux, so this means on each point release we have to assess what changes might have been made, backport these changes to the blubber files, then also backport them to our bespoke helm chart.
This is time-consuming, not because we have to wait for many builds to complete in CI (which we do), but because it requires us to examine the changes carefully and replicate the beahviour with blubber and the deployment pipeline.
Potential solutions
It is understandable that the SRE team does not wish for a proliferation of operating systems to exist in the production realm to occur.
Current policy requires that every container must be based on a Debian based production image.
We could investigate any of the following methods to address the problem:
- Consider the Alpine based images released on Docker Hub to be a permissible source of application binaries and supporting files. This commit provides an example of how we might do that with the DockerHub based images, simplt copying binaries from Alpine images to Debian images. Initial inspection suggests that they *should* work, given that they are Java binaries and do not use JNI.
- Replicate the upstream build process on our own version of Alpine and then copy the necessary files from those ephemeral artifacts to Debian images.
- Move our deployment of DataHub out of the production realm, instead running it under WMCS where the risks associated with using third-party containers are lower.
- Review the policy around Alpine containers in production and make an exception for DataHub, allowing us to build our own version with an unmodified build process from upstream.
I'm sure that there are other potential solutions and I welcome any suggestions.