[SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	gmodena
	Nov 16 2022, 1:24 PM

Description

This should be a quick spike to understand how hard it would be to replicate mw stream enrichment with pyflink. The goal is to;

Run a read-only python implementation of Mediawiki Stream Enrichment on YARN (https://gitlab.wikimedia.org/-/snippets/42).
Collect resource allocation and latency metrics for a long running pyflink job.
Help inform integration paths with the upcoming Flink catalog. https://phabricator.wikimedia.org/T322022.
Help requirement collection for https://phabricator.wikimedia.org/T322125 .

Related Objects

Mentioned In: T324951: We should provide utilities for local development and unit testing of Python streaming services
Mentioned Here: T322022: Flink SQL queries should access Kafka topics from a Catalog
T322125: [NEEDS GROOMING] Improve reliability of simple stateless services

Event Timeline

gmodena created this task.Nov 16 2022, 1:24 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptNov 16 2022, 1:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Best SQL Example here. Will be much better with a catalog.

I don't love how weird this is to do in SQL, with the nested query and UDF based on the content_slots map field, instead of the lower level just content_body. That experiment made me think that staying in Python is going to be easier than focusing on full SQL support.

lbowmaker moved this task from Backlog to Sprint 05 on the Event-Platform board.Nov 16 2022, 3:04 PM

lbowmaker edited projects, added Event-Platform (Sprint 05); removed Event-Platform.

gmodena claimed this task.Nov 17 2022, 8:48 AM

gmodena moved this task from Next Up to In Progress on the Event-Platform (Sprint 05) board.Nov 28 2022, 2:04 PM

gmodena moved this task from In Progress to Next Up on the Event-Platform (Sprint 05) board.Nov 28 2022, 2:07 PM

gmodena renamed this task from [NEEDS GROOMING][SPIKE} Evaluate a pyflink version of Mediawiki Stream Enrichment to [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment.Nov 28 2022, 3:01 PM

gmodena updated the task description. (Show Details)

gmodena moved this task from Next Up to In Progress on the Event-Platform (Sprint 05) board.Nov 29 2022, 12:35 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Dec 1 2022, 2:12 PM

A pyflink implementation of Mediawiki Stream Enrichment has been developed and deployed on YARN. While this implementation did not write to a kafka topic directly, all enriched messages (48 hours worth of data) passed jsonschema validation. The python implementation has feature parity with the Scala one. In particular:

It is built atop the DataStream API and operates on DataFrame[Row] (note that Row here is a pure python object and not a JVM one).
Errors are reported to a sideoutput of String type.
Http client implements retry logic with backoff.
Latency, resource consumption and GC footprint are similar between the two implementations.

Relevant code: