Page MenuHomePhabricator

Explore how to migrate PyFlink to Java/Scala
Open, Needs TriagePublicSpike

Description

Currently, we have a repository with 2 PyFlink pipelines. This repository was created with PyFlink rather than vanilla Flink with the idea that other teams could easily create their own pipelines.
It turns out that no one is building their own pipelines, and the integration between Python and Flink has created many issues in the past, or complications while upgrading versions.

We want to consider how hard would be to move these 2 pipelines (content_history.py and page_content_change.py) to Java or Scala and if that would make our lives easier in the future.

We need to consider at least these things:

  • Language: Java or Scala
  • Build system: Maven, Gradle, SBT or others.

@dcausse has suggested to avoid the Scala Flink API as it is now deprecated. We could still consider using Scala but using the Java API. They are also using Maven to build the project, so we could use it too unless there are other reasons to chose otherwise. We can also take their Search update pipeline as an example.

This task is completed if:

  • The work to move the Pyflink pipelines is assessed and described.
  • A decision is made: Whether to move the pipelines out of Python or not.
  • If doing the migration: Java or Scala is chosen. -> Latest Java possible.
  • If doing the migration: Maven, Gradle, SBT or others. -> Maven
  • If doing the migration: Tasks have been created for the pending work.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptDec 5 2025, 3:09 PM

If we do this...java for sure! we built all of our Flink library tooling in Java. I like scala too, and while it makes coding some things easier, it makes integrating with different unexpected things harder.

The choice of pyflink was to help solve a problem: to enable teams to build and own their (realtime) derived data pipelines. But, as you say, no one is doing this. So, before we make a decision like this, I’d really like to work with @GGoncalves-WMF on the broader derived data problem from a platform product management perspective. What do our users need and what do we want to provide for them? So, I’d prefer if we moved a bit slow and carefully on this. There are lots of questions about how to do the data transfer between data platform and online storage for serving, as well as for streaming enrichment, etc.

See also:

(and many more links I could drop ... :) )

After a conversation in Slack, it looks like the majority of the people agree with at least a few things: We'll do the migration, and we'll use Java and Maven. I'll create some subtasks describing the work needed.

Sorry I'm late to this, but I basically second Andrew's comment. I think there are two things at play here:

  • Do we want to reimplement our current two pipelines (content_history.py and page_content_change.py) in Java?
  • Do we want to offer PyFlink as a platform capability to other teams?

The second is the one that factors into derived data use cases and should have product input, and I agree we shouldn't make that call yet.

The first is not entirely independent (if we run PyFlink pipelines, we become better prepared to support other people's pipelines), but I don't really oppose it. We should clearly document the issues we're encountering with PyFlink though, so we know what to come back to if we decide to support it. Are those documented anywhere yet?

Good point @GGoncalves-WMF, I got lost after many Slack messages. I'll keep Java and Maven as "decided", and we can discuss more about the use cases.

After some conversations, we think there are different paths related to this:

  • How we support other teams to work on pipelines. As @Ottomata said, what do our users need and what we want to provide for them. This isn't clear yet, and we think it might be clearer during the next quarter.
  • How we want to build our own pipelines, which after seeing many team member comments, it seems to be in Java.

As we have a pending task to build a new enrichment pipeline, we think that a good idea is to start building this new pipeline in Java, which will add the bases for a possible migration of other pipelines in the future, but it won't close the door to PyFlink for now. During the next quarter we can think about what we want to provide other teams and what we want to do with our PyFlink pipelines. Also, this approach won't block the new pipeline until a migration is done.

Please proceed with the plan!

But I want to support @GGoncalves-WMF thoughts and comment on this reasoning too:

[...] there are different paths related to this:

  • How we support other teams to work on pipelines [...]
  • How we want to build our own pipelines [...]

I don't think these are necessarily separate considerations. They might be; e.g. we want to support the Search team with capabilities to build complex realtime pipelines, and to do that pyflink is not the way. This is why we built as much pyflink library support in Java as we could.

But, for simpler things that we suspect have many use cases, it would be better to use the same technology for simple pipelines we own and simple pipelines that other teams own. "Streaming enrichment" was this simple use case.

The ML team is relying on change-propagation for streaming enrichment now. I think they will have many more of these use cases in the future too. How are we going to support them and other teams that need to do this?

If we use different technologies for this use case than other teams do, the experience that other teams have will be worse. We will not be dogfooding our own platform.

Please proceed with the plan!

I've been reading MRs T360794 for and this has made me wonder about the decision to do this now.
I'm sorry I haven't been in meetings where we have been discussing this, but are we sure this is the right decision? Have we justified it well given the larger context and product needs?

We (Thomas, Javier, me and a little bit of Joseph) discussed this in today's DE Engineer sync. It is a tricky decision to make for sure. The DE team (myself included) would prefer to work in Java Flink rather than PyFlink. However, I'm not sure deploying T360794: Event stream with latest revision HTML & parent revision HTML diff in Java Flink now is an efficient use of our time.

I agree with Fabian's unanswered comment in T360794#11433691. Doing Java now for the HTML stream work is a lot of work; whereas we have had a functional MR for this for while now.

It would be a more efficient use of our time to do the required data modeling work for rendered page content, and emit HTML using our existent PyFlink tooling.

Why is Java Flink now inefficient work?

On Dec 3, @tchin wrote:

We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later

A "quick implementation now" (in PyFlink) is based on all the prior platform work done to support PyFlink based enrichment pipelines. There is more underneath than just Kafka -> Kafka. The eventutilities-python stuff makes it easy to implement and deploy simple enrichment pipelines. Some examples:

  • The sources and sinks do not have to be Kafka, and in practice this makes testing the pipelines and backfilling quite nice and easy.
    • We are able to test pipeline code by configuring them to use file based sources and sinks.
    • We are able to backfill simply by changing the sources and sink configuration. We can run a pipeline in Hadoop, source from HDFS, and produce to Kafka or HDFS with 0 code changes.
    • We can configure the Kafka offsets used, allowing for backfill from Kafka history.
  • The jsonargparse based configuration allows us to merge configs from several sources, making it easy to change the way the pipeline behaves in different environments. E.g. use the same config file but override one of the args on the CLI.
  • Side output error sinks are optional and configurable. (Produce enriched events to Kafka but write error events to stdout or to a file? Okay!) This allow us to use them in production but not in development or staging.

etc.

Testing, backfilling, error events, etc. are supported via configuration now, and it was a lot of work to build this. Do we plan to do this work in Java too? If not, is it worth losing these existing features?

the complexities of dealing with any migration pains later

Given the larger picture still has many unknowns (what platform support do we need to provide? What to do about Change Prop?, etc. etc.) I'd say we are not even sure that Flink is the best solution for WMF's streaming enrichment needs. It may be, but we need to do more research and evaluations to find out. It may be possible that we will migrate all of our enrichment jobs to something that is not Flink. If we do Java for T360794 now, we may have to migrate that Java work too.

But what about PyFlink/upgrade woes?

There are some, but I am not aware of many.

integration between Python and Flink has created many issues in the past

I know specifically of T347282: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions, but its impact is quite minimal. Also:

FLINK-38190 Support async function in Python DataStream API is now available in Flink 2.2.0.

So an upgrade and change to use the async API should hopefully solve this one.

complications (T408918) while upgrading versions.

T408918: Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.3 and Java 17 is not really about Flink or PyFlink, but just about complications between OS provided packages and WMF provided packaging systems. Java and Python are indeed having conflicts here, but not because of Flink things. The ticket that blocks this work T406872: Fix mediawiki event enrichment to work with newest version of Blubber is about the way eventutilities-python uses jsonargparse to parse configs. I don't think this would be a difficult fix: we'd just have to change the mode for which we check for file existence. Then we can upgrade blubber, then we can upgrade Java and Python and Flink. The blocker is our use of jsonargparse, not python. A Java arg parser (or any lib that accessed /srv) could have caused the same bug and blocked the upgrade.

Are there any woes I'm forgetting?


tl;dr
Please note that I am certainly not opposed to dropping PyFlink support one day, but I think we need some more research and justification to do it! It is very true that as of yet no one else has deployed their own PyFlink based pipeline. Switching to Java for ourselves may be fine, but we should recognize the fact that we are discarding more than just 'python' for ourselves. By doing T360794: Event stream with latest revision HTML & parent revision HTML diff in Java now, we are rebuilding (and not building) many things that are already built, and may someday soon(ish) discard it anyway.