Page MenuHomePhabricator

[EPIC] Upgrade flink jobs to java 17
Open, HighPublic

Description

After upgrading jobs to flink 1.20 (T376812) we should consider also upgrading the jobs to support java 17.
Java 17 is marked as experimental with flink 1.20 but fully supported in flink 2.
To ease operations of the flink image where we'd like to no longer support java11 we might try to test and see if java17 is working for our jobs running flink 1.20

Actual work will need to happen in separate subtasks:

  • T400600: flink 1.20 base image with java 17: docker-registry.wikimedia.org/flink:1.20.2-wmf1-20250912
  • [TODO] test eventutilities artifacts with java 17
  • [TODO] test eventutilities_python with java17
  • T404417 upgrade SUP to java 17
  • T404944 (we will test the java 17 base image while migrating to flink 1.20.2)
  • T408918 upgrade dse enrichment jobs to java 17

AC:

  • all flink deployments run on java 17

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Specify openjdk package as a gitlab variablerepos/data-engineering/workflow_utils!56tchinconda-setup-openjdk-variablemain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse updated the task description. (Show Details)
pfischer moved this task from needs triage to Active Epics on the Discovery-Search board.

IIUC, recent 1.20.2 flink image versions are only Java 17?

Could we consider adding the java version in the image version too? E.g. something like '1.20.2-java17-wmf1-20250914' ? That way we could choose and determine which java version is being used?

IIUC, recent 1.20.2 flink image versions are only Java 17?

Could we consider adding the java version in the image version too? E.g. something like '1.20.2-java17-wmf1-20250914' ? That way we could choose and determine which java version is being used?

Well the goal of this task is to attempt to get rid of java11 so that we can stop backporting this java version. If willing to stick to java11 there's the option to stay on flink 1.17 or the previous flink 1.20.1 image. If there's a strong need to have both java11 and flink 1.20, indeed we would have to tackle T401694 first.

A quick note on the Java 17 is marked as experimental. This mostly applies for java17 features, say you're building your job using java17 and start using java records, they added some support of this, if building your job with with java8 or java11 I believe this is less risky.

  • We'll port the search update pipeline to java17 and test (actually building it with java17).
  • The wdqs updater won't be java17 anytime soon and could serve as a test to see if a job built with java8 can run on top of this image.

Okay, ^ makes is sounds much less risky. Thanks.

Oh, or do we need T401694 first?

JFTR, maintaining Java 11 in parallel is a significant time commitment, we'd mostly do it if needed for Hadoop and until decision on the Hadoop update is made, I won't book the time for it. That said, if Flink isn't working with 17 at all, then ofc that is also a reason to move forward with T401694 (but let's finish the current tests first)

We need to spike readyness of eventutilities_python to adopt Java17. In _theory_ there should not be significant bytecode changes in our code paths. In practice I would not consider this transition entirely risk free.

We need to spike readyness of eventutilities_python to adopt Java17. In _theory_ there should not be significant bytecode changes in our code paths. In practice I would not consider this transition entirely risk free.

Sure, added a bullet point regarding eventutilities_python, framed it as just "testing" that it can run with java17, could possibly be done by adding an extra CI step with a java17 image?

We need to spike readyness of eventutilities_python to adopt Java17. In _theory_ there should not be significant bytecode changes in our code paths. In practice I would not consider this transition entirely risk free.

Sure, added a bullet point regarding eventutilities_python, framed it as just "testing" that it can run with java17, could possibly be done by adding an extra CI step with a java17 image?

I think so. The gitlab pipeline builds atop openjdk11 image: https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/blob/main/.gitlab-ci.yml?ref_type=heads#L13
cc / @tchin @Ottomata

re eventutilities_python: another point is that booksworm ships with python 3.11 instead of 3.9 (pyflink claims to be compatible with it)

I did a quick test using the search flink job, unfortunately it failed because some newer java17 jvm options are not passed properly, esp. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/flink-kubernetes-operator/conf/flink-conf.yaml#27 which is supposed to attach those for any job running flink v19 or later. Overall this is good news because these options are only useful for java17 which suggests that java17 is actively used by others. We just have to debug the operator chart or flink-app chart to understand why these options are not propagated to downstream jobs.

Change #1189823 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] flink-operator: allow upstream config to be applied

https://gerrit.wikimedia.org/r/1189823

With some tweaks of the java option we have the search flink job running java17 and bookworm in wikikube staging, we'll let it run for a while and monitor various metrics.

Change #1189823 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: allow upstream config to be applied

https://gerrit.wikimedia.org/r/1189823

The rdf-streaming-updater is now running on top of java17 with flink 1.20.2 on both codfw and eqiad. So far it looks stable.