We currently run Datahub 0.13.3, which was released a bit more than a year ago. In the meantime, the changelogs contain pretty bold claims, such as
- A completely redesigned user experience focused on simplified navigation and a visually stunning interface.
- Unified support for Data & AI, including AI Model Group Versions, AI Model Lineage, Model Stats, and Experiment/Run ingestion.
- DataHub Iceberg Catalog, allowing users to manage Iceberg tables directly from DataHub.
Datahub is a current SRE pain point:
- our chart has incrementally diverged from upstream
- acryl-datahub library dependencies constraint conflict with other libraries installed in the airflow image (see https://phabricator.wikimedia.org/T402306#11098814)
- We don't deploy/run the ingestion backend, meaning we can only ingest from the CLI
- The way we currently ingest Hive/Kafka/Druid data is by running the datahub CLI in a Skein job using the datahub-cli-0.10.4.tgz venv artifact, which packaging documentation is out-of-date (see https://phabricator.wikimedia.org/T402306)
We could benefit from upgrading Datahub, to make sure our tooling hasn't drifted by much and that we can react quickly in case of CVEs, and it would also empower the WMF Data community to leverage Datahub to its fullest capacity.