Page MenuHomePhabricator

[SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment
Closed, ResolvedPublic


This should be a quick spike to understand how hard it would be to replicate mw stream enrichment with pyflink. The goal is to;

  1. Run a read-only python implementation of Mediawiki Stream Enrichment on YARN (
  2. Collect resource allocation and latency metrics for a long running pyflink job.
  3. Help inform integration paths with the upcoming Flink catalog.
  4. Help requirement collection for .

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Best SQL Example here. Will be much better with a catalog.

I don't love how weird this is to do in SQL, with the nested query and UDF based on the content_slots map field, instead of the lower level just content_body. That experiment made me think that staying in Python is going to be easier than focusing on full SQL support.

gmodena renamed this task from [NEEDS GROOMING][SPIKE} Evaluate a pyflink version of Mediawiki Stream Enrichment to [SPIKE] Evaluate a pyflink version of Mediawiki Stream Enrichment.Nov 28 2022, 3:01 PM
gmodena updated the task description. (Show Details)

A pyflink implementation of Mediawiki Stream Enrichment has been developed and deployed on YARN. While this implementation did not write to a kafka topic directly, all enriched messages (48 hours worth of data) passed jsonschema validation. The python implementation has feature parity with the Scala one. In particular:

  • It is built atop the DataStream API and operates on DataFrame[Row] (note that Row here is a pure python object and not a JVM one).
  • Errors are reported to a sideoutput of String type.
  • Http client implements retry logic with backoff.
  • Latency, resource consumption and GC footprint are similar between the two implementations.

Relevant code:

A writeup of the outcome of this spike can be found at Feedback is welcome on the discussion page.