⚓ T324689 [EPIC] Streaming and event driven Python services

	Title	Reference	Author	Source Branch	Dest Branch
	Add link to WIP documentation	repos/data-engineering/eventutilities-python!11	gmodena	add-doc-link	main

Status	Assigned	Task
Resolved	Ottomata	T324689 [EPIC] Streaming and event driven Python services
Resolved	gmodena	T324746 Flink wrappers and helper libraries should be moved into a dedicated git repo with packaging and CI.
Resolved	gmodena	T324951 We should provide utilities for local development and unit testing of Python streaming services
Declined	None	T324953 [NEEDS GROOMING] Integrate Flink Table API in eventutils-python
Resolved	gmodena	T326565 Tests for mediawiki-stream-enrichment-python flink job via eventutilities-python
Resolved	gmodena	T326567 Gitlab CI pipeline for Python applications should bundle Java eventutilities and runtime deps

gmodena created this task.Dec 7 2022, 4:46 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptDec 7 2022, 4:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

gmodena renamed this task from [EPIC] Streaming and event eriven Python services to [EPIC] Streaming and event driven Python services.Dec 7 2022, 4:46 PM

Ottomata updated the task description. (Show Details)Dec 7 2022, 4:51 PM

lbowmaker updated the task description. (Show Details)Dec 7 2022, 5:35 PM

lbowmaker updated the task description. (Show Details)Dec 7 2022, 7:36 PM

xcollazo subscribed.Dec 7 2022, 8:36 PM

gmodena updated the task description. (Show Details)Dec 8 2022, 9:51 AM

gmodena updated the task description. (Show Details)Dec 12 2022, 10:50 AM

Ottomata moved this task from Backlog to Parent Tasks/Epics on the Event-Platform board.Dec 13 2022, 8:16 PM

JArguello-WMF updated the task description. (Show Details)Dec 14 2022, 2:04 PM

@gmodena, I've been trying to write tests for eventutilities-python, so that we could more easily improve and add things (like error event side output, etc.).

I've spent two days struggling with writing and testing a simple enrichment pipeline. I've even simplified and am just trying to write a simple python datastream enrichment map function that goes from one stream with a schema to an output with a different schema. I could not get this to work!

It took me forever to realize how this works for the page content enrich pipeline: the output is the same schema as the input! If the output schema is different, we can't just treat the Row in the map function as a dict. Doing so will result in a ValueError being thrown when trying to assign a field to Row that the Row doesn't already know about.

I tried to work around this by recursively converting the Row to a dict before providing it to the map function. This works fine, but we have no way to convert a nested dict back to a Row recursively (unless we write it ourselves). (I tried RowTypeInfo.from_internal_type, but I couldn't get it to work?).

I think if we want to be able to treat the event like a Python dict, we are going to have to implement custom recursive converters in Python between pyflink Row <-> and python dict.

I'd hope we can do somethign like this:

def enrich_fn(event: dict) -> dict:
    event['enriched_field'] = 'enriched value'
    return event

# gets the source and sink row types via EventDataStreamFactory
with stream_manager(source_stream_name='...', sink_stream_name='...', sink='kafka...?') as stream:
    stream.map(enrich_fn)
    stream.execute()

To do this, the functions users implement all have to work with dicts, which will require conversion of the input data stream, and also conversion of the final output datastream back to Row of the output RowTypeInfo.

Okay, I think I have something working?

We can already easily recursively convert Rows to dicts.

dict_to_row will recursively convert any dicts that should be Rows to Rows using the RowTypeInfo.

This allowed me to make the stream manager stuff always pre-convert the datastream to dicts, and then convert back to the output Row type before sending to the sink.

In this way, all user provided map, filter, etc. functions work with dicts.

Still SUPER WIP, but I got it working with our Kafka sink stuff, and was able to output using a different schema.

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Jan 6 2023, 11:01 AM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Jan 6 2023, 12:51 PM

In T324689#8500218, @Ottomata wrote:

It took me forever to realize how this works for the page content enrich pipeline: the output is the same schema as the input! If the output schema is different, we can't just treat the Row in the map function as a dict. Doing so will result in a ValueError being thrown when trying to assign a field to Row that the Row doesn't already know about.

I tried to work around this by recursively converting the Row to a dict before providing it to the map function. This works fine, but we have no way to convert a nested dict back to a Row recursively (unless we write it ourselves). (I tried RowTypeInfo.from_internal_type, but I couldn't get it to work?).

I think if we want to be able to treat the event like a Python dict, we are going to have to implement custom recursive converters in Python between pyflink Row <-> and python dict.

Correct, this is not implemented yet. Last quarter we ended up prioritising the rollout of a python version of mediawiki stream enrichment over API completeness.
Let's address this missing bit ASAP though. We'll need it, among other things, to produce messages to an error topic.

I was wondering if we could reuse the projection/row creation primitives from JVM eventutilities, but it turned out that would require additional (non trivial) wrapping. Doing the conversion in Python should be more straightforward.

In T324689#8503145, @Ottomata wrote:

Okay, I think I have something working?

Terrific!

LGTM. Maybe we can test it out within the scope of https://phabricator.wikimedia.org/T326536?
Would be great to have some unit tests (more in general), but it's tricky with the current CI setup / java deps. It may be be time
to fix that properly.

I do wonder how much overhead this conversion will introduce. It should not be too bad (especially at low throughput),
but something we might want to instrument.

JArguello-WMF closed subtask T324746: Flink wrappers and helper libraries should be moved into a dedicated git repo with packaging and CI. as Resolved.Jan 9 2023, 3:02 PM

gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/11

Add link to WIP documentation

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/11

Add link to WIP documentation

JArguello-WMF closed subtask T326567: Gitlab CI pipeline for Python applications should bundle Java eventutilities and runtime deps as Resolved.Jan 27 2023, 8:13 PM

JArguello-WMF closed subtask T324951: We should provide utilities for local development and unit testing of Python streaming services as Resolved.

lbowmaker closed subtask T326565: Tests for mediawiki-stream-enrichment-python flink job via eventutilities-python as Resolved.Feb 17 2023, 2:51 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:48 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 29 2023, 9:48 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:32 PM

JArguello-WMF edited projects, added Data Engineering and Event Platform Team; removed Data-Engineering.Jun 30 2023, 4:15 PM