Page MenuHomePhabricator

Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.3 and Java 17
Closed, ResolvedPublic

Description

Currently the enrichment jobs run on Flink 1.20.1 and Java 11. This is the last docker image available that has Java 11. An upgrade to Flink 1.20.3 requires Java 17.

Simply bumping the image isn't enough since the distro in the latest images trigger PEP 668 - Marking Python base environments as "externally managed". This breaks our build step. (I think?) the way to fix this is to update Blubber and use its new use-system-site-packages option, but that is blocked because of T406872

Details

Other Assignee
Ottomata
Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Bump Flink to 1.20.3repos/data-engineering/mediawiki-event-enrichment!121javiermontonfeature/bump-eventutilitiles-pythonmain
Bump Flink to 1.20.3repos/data-engineering/eventutilities-python!97javiermontonfeature/bump-flink-to-1.20.3main
Fix: java.net.MalformedURLException: no protocolrepos/data-engineering/eventutilities-python!96javiermontonfix/malformed-url-exceptionmain
Bump eventutilities-python to 0.21.0repos/data-engineering/mediawiki-event-enrichment!117javiermontonfeature/bump-eventutilities-pythonmain
Upgrade to flink 1.20.2 image which uses Java 17.repos/data-engineering/mediawiki-event-enrichment!112ottojdk-17main
Use Java 17repos/data-engineering/eventutilities-python!93tchinuse-jdk-17main
Customize query in GitLab

Event Timeline

Okay, I've submitted another MR!94 that I think addresses a lot of the in our face CI problems.

The gist of the MR so far:

  • Fix jsonsargparse bug (originally solved by @tchin in his MR!93.
  • Refactor Dockerfile, .gitlab-ci and Makefile so that Makefile is consistently used for all job/targets in all ways of executing it. That is: a python environment with tox installed executes tox commands. A python venv is created for Docker local and Gitlab CI to address PEP 668.

And now! Finally we are hitting some PyFlink + Java 17 (?) related issues! Everything so far has been other python dependency or CI env problems.

This test job has failed test output. I can reproduce this in local Docker as well.

It would be nice if we could have separated the CI problems from the Java upgrade, but we can't build on top of our Java 11 images anymore because they are based on debian buster, which doesn't work any more because buster-backports debian repo is gone.

So, we have to solve Java 17 issues now.

Next up: dive into the test failures to figure out what is going on.

Next up: dive into the test failures to figure out what is going on.

It was apache-beam using pkg_resources after all. Pinning setuptools < 72 did it! Yeehaw!

There was another issue with the new eventutilities-python version, where the argparse library was checking if a folder was writable or not, and it fails on K8s because /srv/app is not writable. It was addressed in this MR and released in version 0.20.0.

Still, after deploying 0.20.0, there is another issue, it seems to be similar to this Flink issue, but according to the docs, it was fixed.

This is the log:

java.net.MalformedURLException: no protocol: ['file:/usr/local/lib/python3.11/dist-packages/pyflink/opt/flink-python-1.20.2.jar', 'file:/usr/local/lib/python3.11/dist-packages/pyflink/opt/flink-python-1.20.2.jar', 'file:///opt/lib/venv/lib/python3.11/site-packages/eventutilities_python/lib/flink-connector-kafka-3.3.0-1.20.jar', 'file:///opt/lib/venv/lib/python3.11/site-packages/eventutilities_python/lib/eventutilities-flink-1.4.0-jar-with-dependencies.jar', 'file:///opt/lib/venv/lib/python3.11/site-packages/eventutilities_python/lib/flink-python-1.20.0-tests.jar', 'file:///opt/lib/venv/lib/python3.11/site-packages/eventutilities_python/lib/kafka-clients-3.4.0.jar']

This MR: https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/96 addresses the latest issue. I have published a dev0 version of this change and tested in K8s, and now it is working fine.
We still need to merge this MR, create a proper release, and include it on the applications using the failing version.

Change #1248808 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich-next

https://gerrit.wikimedia.org/r/1248808

Related to that previous MR, I believe that the issue was solved in Flink 1.20.3 with this commit. Maybe we could ensure we use that one, rather than 1.20.2.

Change #1248808 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich-next

https://gerrit.wikimedia.org/r/1248808

@JMonton-WMF in eventutilities-python MR96 you wrote:

It seems to be similar or related to https://issues.apache.org/jira/browse/FLINK-36457, which I'm not reading it wrong, it was fixed on version 1.20.3

Should we just upgrade to 1.20.3? If available in PyPI, it should be pretty easy to do. I can make it available today probably.

Ottomata renamed this task from Upgrade mediawiki-event-enrichment jobs to Flink 1.20.2 and Java 17 to Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.2 and Java 17.Mar 9 2026, 2:00 PM

Change #1249312 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] flink - bump to 1.20.3 to pick up fix for FLINK-36457

https://gerrit.wikimedia.org/r/1249312

Ottomata updated Other Assignee, added: Ottomata.

Sounds good. I was a bit blocked as the previous version was failing and ended up doing that work around. But definitely upgrading to 1.20.3 looks better.

Change #1249312 merged by Btullis:

[operations/docker-images/production-images@master] flink - bump to 1.20.3 to pick up fix for FLINK-36457

https://gerrit.wikimedia.org/r/1249312

Change #1251456 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1251456

Change #1251456 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1251456

The workaround added in event utilities has been reverted, the version 0.22 doesn't use it anymore.
Mediawiki-event-enrichment is now using the new Flink 1.20.3 docker image. It's deployed with the HTML enrichment pipeline, but not on the others yet.
I'll deploy the others on Monday.

Change #1253425 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-content-history-reconcile-enrich

https://gerrit.wikimedia.org/r/1253425

Change #1253426 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-content-change-enrich

https://gerrit.wikimedia.org/r/1253426

Change #1253425 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-content-history-reconcile-enrich

https://gerrit.wikimedia.org/r/1253425

Change #1253426 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-content-change-enrich

https://gerrit.wikimedia.org/r/1253426

mforns renamed this task from Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.2 and Java 17 to Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.3 and Java 17.Mar 16 2026, 3:57 PM
mforns updated the task description. (Show Details)

Folks I just did a deployment for mw-content-history-reconcile-enrich to get https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1253633 into prod, but got this unexpected diff:

-   image: docker-registry.discovery.wmnet/repos/data-engineering/mediawiki-event-enrichment:v1.43.0
+   image: docker-registry.discovery.wmnet/repos/data-engineering/mediawiki-event-enrichment:v1.44.0

I should have followed my instinct but I didn't, and I pushed the change to prod. However, this image upgrade broke production.

I just reverted via 1253650: Revert "stream: mw-content-history-reconcile-enrich" | https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1253650.

I suggest we revert all other production pyflink image changes, and that we test this image change only in -next deployments until we sort the kinks out.

Slack thread with more details.

Being bold and reverting changes to mw-page-content-change-enrich to avoid inadvertently repeating T408918#11715866.

1253653: Revert "stream: mw-page-content-change-enrich" | https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1253653

Thanks @xcollazo, I was about to do that deployment today. I'll search logs to understand what happened.

The version 1.44.0 that includes Flink is already deployed on the HTML enrichment pipeline and it works, but we haven't tried with page_content_change yet.

As you suggest, I'll deploy the -next version with the new image and test it, when it works, I'll do the same for the regular deployment.

Change #1254118 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-content-history-reconcile-enrich-next

https://gerrit.wikimedia.org/r/1254118

Change #1254118 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-content-history-reconcile-enrich-next

https://gerrit.wikimedia.org/r/1254118

Change #1254132 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-content-history-reconcile-enrich

https://gerrit.wikimedia.org/r/1254132

Change #1254135 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-content-change-enrich

https://gerrit.wikimedia.org/r/1254135

Change #1254132 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-content-history-reconcile-enrich

https://gerrit.wikimedia.org/r/1254132

Change #1254135 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-content-change-enrich

https://gerrit.wikimedia.org/r/1254135

The issue was that some time ago, between v1.43 and v1.44 we moved all the code into the python/ folder, so the entry point in Helm needed an update too.

Now all PyFlink applications are on v1.44 and running on prod.