[Event Platform] Event Platform and DataHub Integration
Closed, ResolvedPublic
Actions

Description

Background/Goal

Build a poll based ingestion that iterates over declared streams in event stream config, looks up the schemas for the streams, and then posts the updates to datahub.

Ideally, datahub would represent these as 'stream' datasets, not just Kafka topics, as a 'stream' is made up of multiple kafka topics. In the interim, we may want to just use Datahub's built in support for Kafka topics and post updates to kafka topic schemas.

We have a Java library that automates interacting with EventStreamConfig and schema repositories to make it easier to iterate through declared streams and look up schemas. We could use this java library to do this, or alternatively write Event Platform utility library code in another language (python?).

There are two types of datahub ingestion, poll based and event based. Thus far, we have not been using event based ingestion. While event based ingestion would be much nicer, implementing metadata events of changes to Event Platform streams and schemas might be difficult. To do this, we'd need to emit events anytime EventStreamConfig is deployed, which would somehow have to be linked to MediaWiki config deployments.

User Story

As an event platform engineer, I need to keep DataHub up to date with new and existing event streams and schemas

Key Tasks/Dependencies

Add Event Stream custom platform to DataHub
Deploy the metadata ingestion pipeline for the event stream schemas
Link the corresponding topics and downstream datasets to the stream (TBD via lineage and/or metadata replication. This falls into the larger consideration on how to propagate metadata between equivalent datasets stored across different platforms and refinements.)

Acceptance Criteria

A custom platform called Event Streams is listed in DataHub
The Event Streams data hub lists all the event stream data assets
All the schema elements are documented
Top level schema description is imported as the top level dataset documentation

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Support DataHub transformers	repos/data-engineering/airflow-dags!507	tchin	support-datahub-transformers	main
	Add event streams datahub transformation	repos/data-engineering/airflow-dags!498	tchin	event-platform-integration	main

Customize query in GitLab

Related Objects

Mentioned In: T333829: Format log entries on Special:Log/…
T331514: [Goal] M1: Metrics Platform: Control Plane: Analytics instrumentation stream management UI
T337321: Automating pulling schemas from eventschema to datahub
T307040: Propagate field descriptions from event schemas to Hive event tables
Mentioned Here: T273235: [Metrics Platform] Define stream configuration syntax relevant to v1 release
T344235: Remove `null` entry from custom_data.[].value enum in monoschema
T201063: Modern Event Platform: Schema Repostories
T307040: Propagate field descriptions from event schemas to Hive event tables

Event Timeline

lbowmaker created this task.Sep 28 2022, 8:05 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptSep 28 2022, 8:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Oct 18 2022, 6:03 PM

@EChetty @lbowmaker is this essentially a duplicate of T307040: Propagate field descriptions from event schemas to Hive event tables?

Hm, not quite!

This task is about cataloging the streams that exist in Kafka, with their event schemas.

T307040 is talking about the event tables in Hive, which are created from the streams, but are not exactly the same thing (e.g. there are enriched fields in the Hive tables, like geocoded_data, etc.). But I agree the descriptions should be propagated from the event schemas into the Hive schemas...and then also into DataHub. I'll comment on that task.

mpopov mentioned this in T307040: Propagate field descriptions from event schemas to Hive event tables.Oct 20 2022, 6:26 PM

Ohhhh, thank you for clarification!

odimitrijevic added a project: Data-Catalog.Oct 21 2022, 4:07 PM

Ottomata updated the task description. (Show Details)Nov 3 2022, 6:04 PM

No clue if this is the right approach, but perhaps we could use ingestion transforms to augment the existent kafka ingestion with Event Platform event schemas? From 5 minutes of reading docs, I think we'd do this by implementing a transformer that can transform the schemaMetadata aspect of a dataset entity?

def entity_types(self) -> List[str]:
        return ["dataset"]

    def aspect_name(self) -> str:
        return "schemaMetadata"

Ottomata added a subscriber: Htriedman.Nov 3 2022, 6:46 PM

@Ottomata I think the above is the right approach (if we decide to do it)

• EChetty moved this task from Backlog to Data Catalog on the Data-Engineering-Planning board.Nov 7 2022, 10:52 AM

JArguello-WMF moved this task from To be Estimated/To be discussed to Backlog on the Event-Platform board.Jan 11 2023, 3:16 PM

JArguello-WMF updated the task description. (Show Details)Jan 25 2023, 2:57 PM

JArguello-WMF updated the task description. (Show Details)

Htriedman mentioned this in T337321: Automating pulling schemas from eventschema to datahub.May 30 2023, 5:03 PM

Ottomata updated the task description. (Show Details)May 30 2023, 5:03 PM

Ottomata mentioned this in T331514: [Goal] M1: Metrics Platform: Control Plane: Analytics instrumentation stream management UI.May 30 2023, 5:10 PM

Ottomata mentioned this in T333829: Format log entries on Special:Log/….

JArguello-WMF removed a project: Data-Catalog.Jun 29 2023, 3:42 PM

JArguello-WMF added a project: Data-Catalog.Jun 29 2023, 9:24 PM

JArguello-WMF moved this task from Data Catalog to To be Discussed on the Data-Engineering-Planning board.Jun 29 2023, 9:27 PM

JArguello-WMF moved this task from To be Discussed to Event Platform on the Data-Engineering-Planning board.Jun 29 2023, 9:41 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:48 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 29 2023, 9:48 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:31 PM

JArguello-WMF moved this task from Event Platform Backlog to Data Products & Metrics on the Data-Engineering board.Jun 29 2023, 10:51 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 4:29 PM

JArguello-WMF moved this task from Data Eng Backlog to Event Platform Backlog on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:38 PM

@Htriedman we are picking this work up again. Is the POC that you did available in a repository on gitlab?

Hi @odimitrijevic! Here's the gitlab repo I worked on during the documentathon :) https://gitlab.wikimedia.org/htriedman/documentathon-eventstream

Since Datahub has the concept of platforms, I think the best way forward is to have a separate platform called Event Streams where the datasets under it are the streams defined in the stream config. We can then keep the Kafka platform for all the individual Kafka topics. Then what we can do is have a transform attached to the current Kafka ingestion recipe that will attach the schemas to the individual topics when supported but also at the same time insert the streams into the Event Streams platform. This way we can have the schemas on both the stream and its topics

@tchin as discussed today, that sounds like a good approach. Before deploying to production, let's wipe out the kafka metadata given that the original POC was imported under the kafka platform. I'll add these to the acceptance criteria.

In T318863#9097417, @tchin wrote:

Since Datahub has the concept of platforms, I think the best way forward is to have a separate platform called Event Streams where the datasets under it are the streams defined in the stream config. We can then keep the Kafka platform for all the individual Kafka topics. Then what we can do is have a transform attached to the current Kafka ingestion recipe that will attach the schemas to the individual topics when supported but also at the same time insert the streams into the Event Streams platform. This way we can have the schemas on both the stream and its topics

odimitrijevic updated the task description. (Show Details)Aug 16 2023, 9:45 PM

odimitrijevic updated the task description. (Show Details)Aug 16 2023, 10:08 PM

Here are some considerations that we discussed, that we need to further explore and decide on:

Explore creating a custom platform for Event Streams
Add top level event schema description as the dataset documentation. TBD on how to accomplish this given import options.
The schema import automatically adds subgroups under kafka based on the first dot segment of the schema name. In the production instance of DataHub there are also streams with the naming analytics/mediawiki/web_ab_test_enrollment. Can “/” be used as a separator to designate the top level category?
Can we import goblin lineage to propagate lineage from kafka > hive?
There would value to import hive event_raw database for completion of lineage events
Can we add a link to the event platform schema/datahub documentation to hive tables in event and event_sanitized? Lineage would be one way to trace this. Another would be to add links in the documentation to datasets with equivalent schema both upstream and downstream. This falls into the larger consideration on how to propagate metadata between equivalent datasets stored across different platforms and refinements.
Some of the kafka topics are remnants of tests and misconfiguration/misnamings. There is an option to add them to an exclusion list. Ideally we'd delete these in Kafka, otherwise there is an exclusion list.
Given that the prod datahub has the event streams current Kafka metadata can we delete and reimport all the Kafka metadata? If a fresh backup is not available it would be have one handy
Is there a way to add ownership data to event schema json and import it from there? This would benefit Metrics Platform work and allow alerting the right parties about event publishing errors. Some discussion about adding this data already happened https://phabricator.wikimedia.org/T201063#4546544
What is the best way to ingest the metadata? Datahub transformer vs airflow vs TBD?

lbowmaker assigned this task to tchin.Aug 21 2023, 3:22 PM

lbowmaker edited projects, added Data Engineering and Event Platform Team (Sprint 1); removed Data Engineering and Event Platform Team.

lbowmaker moved this task from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 1) board.

After experimenting a lot, I have a Datahub transformer for Kafka that generates an Event Streams platform, adds description, schema, and path. However, I don't know if it should be a transformer since it's doing a bit more than just transforming.

Explore creating a custom platform for Event Streams

Done fairly easily. The documentation shows curl or some ingestion job, but there's an equivalent way to do it through Datahub's python library.

Add top level event schema description as the dataset documentation. TBD on how to accomplish this given import options.

I discovered that Datahub actually distinguishes between user-edited metadata and ingestion-created metadata. This makes any kind of ingestion we try worry-free.

The schema import automatically adds subgroups under kafka based on the first dot segment of the schema name. In the production instance of DataHub there are also streams with the naming analytics/mediawiki/web_ab_test_enrollment. Can “/” be used as a separator to designate the top level category?

The slashes can be used as a separator. However, the slashes in this case are for the names of the schema, and not a Kafka topic (Kafka doesn't allow slashes in its topic names afaik). I don't know why it's in Datahub. I think it would be cool if there was another platform that showed which streams are using which schemas. Maybe linking them up via lineage?

Speaking of which, connecting up the topics to the stream via lineage to me seems a bit weird, but also probably the better way over just having links in the description. Are streams considered upstream or downstream to topics? Would it look something like this:

Screenshot 2023-08-22 at 1.10.20 PM.png (894×2 px, 193 KB)

The streams and the topics might be able to be compressed in the lineage view if they're marked as related to each other. Would have to look into this further.

Given that the prod datahub has the event streams current Kafka metadata can we delete and reimport all the Kafka metadata? If a fresh backup is not available it would be have one handy

From what I can see, there is a way in Datahub to mark things as 'removed' so it doesn't show up in the ui without actually deleting it. We could perhaps run it over the entirety of the Kafka metadata, and then when ingesting the topics we want to check if it already exists in Datahub and mark it as not removed.

From the recent meeting:

Event Streams will be the name of the platform
Streams are upstream to Kafka topics

@BTullis we'll need the SRE team's help with the deployment of the event platform schema ingestion into Datahub. The deployment involves a) creating the event steams custom platform and
b) deploying the ingestion code/transformer

A couple of questions:

Do we have database backups and/or can we take one before we deploy the above? The risk is small that something will go wrong, however this is a good precaution
Is there an option for us to delete/wipe the Kafka data platform schemas so that they are re-ingested afresh? As part of the previous POC we imported the Event Schemas under Kafka, and would like to remove them permanently as they will reside in a custom platform

While adding a workaround to T344235, I noticed that additionalProperties isn't very well represented in DataHub.

"custom_data": {
    "additionalProperties": {
        "properties": {
            "data_type": {
                "type": "string",
                "enum": ["number", "string", "boolean", "null"],
            }
        }
    },
    "propertyNames": {
        "maxLength": 255,
        "minLength": 1,
        "pattern": "^[$a-z]+[a-z0-9_]*$",
    },
},

Just shows up in DataHub as a Struct with no defined nested fields (which I guess makes sense, but is not helpful).

lbowmaker moved this task from Sprint 1 to Sprint 2 on the Data Engineering and Event Platform Team board.Sep 8 2023, 1:26 PM

lbowmaker edited projects, added Data Engineering and Event Platform Team (Sprint 2); removed Data Engineering and Event Platform Team (Sprint 1).

lbowmaker moved this task from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 2) board.

tchin updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/498

Add event streams datahub transformation

Ahoelzl moved this task from In progress to In Review on the Data Engineering and Event Platform Team (Sprint 2) board.Sep 21 2023, 4:10 PM

milimetric merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/498

Add event streams datahub transformation

tchin updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/507

Support DataHub transformers

lbowmaker moved this task from Sprint 2 to Sprint 3 on the Data Engineering and Event Platform Team board.Sep 29 2023, 1:17 PM

lbowmaker edited projects, added Data Engineering and Event Platform Team (Sprint 3); removed Data Engineering and Event Platform Team (Sprint 2).

lbowmaker moved this task from Next Up to In Review on the Data Engineering and Event Platform Team (Sprint 3) board.

tchin merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/507

Support DataHub transformers

lbowmaker moved this task from In Review to Ready to Deploy on the Data Engineering and Event Platform Team (Sprint 3) board.Oct 2 2023, 2:14 PM

Some of the kafka topics are remnants of tests and misconfiguration/misnamings. There is an option to add them to an exclusion list. Ideally we'd delete these in Kafka, otherwise there is an exclusion list.

Could we just only import if there is a corresponding event stream config entry?

Is there a way to add ownership data to event schema json and import it from there?

We shouldn't do this in the schema, but this could be added to event stream config.
https://phabricator.wikimedia.org/T273235#6791925

Ottomata moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.Oct 3 2023, 4:26 PM

TBurmeister subscribed.Oct 5 2023, 7:03 PM

lbowmaker moved this task from Ready to Deploy to Done on the Data Engineering and Event Platform Team (Sprint 3) board.Oct 10 2023, 2:38 PM

Ahoelzl renamed this task from Event Platform and DataHub Integration to [Event Platform] Event Platform and DataHub Integration.Oct 20 2023, 4:51 PM

Ahoelzl closed this task as Resolved.Oct 23 2023, 7:01 PM

	F37616834: Screenshot 2023-08-22 at 1.10.20 PM.png
	Aug 22 2023, 5:22 PM

[Event Platform] Event Platform and DataHub IntegrationClosed, ResolvedPublicActions