Page MenuHomePhabricator

Flink docker image should work with pyflink
Closed, ResolvedPublic9 Estimated Story Points

Description

Background/Goal

//This task contains some notes/WIP on behaviour experienced with Flink wmf images. Needs grooming/investigation to turn it into a feature request.

As a developer I would like to run pytest based CI jobs atop the Flink wmf image.

Current status

I came across some potential issues with PYTHONPATH (pyflink cannot be found) and missing python deps (protobuf, grpc) when trying to run pytest inside a flink wmf container.

Manually installing the python deps (pip install protobugf), and setting up a Python path along the lines of https://gitlab.wikimedia.org/-/snippets/55 fixes part of these issues.

When pyflink is provided (pip install pyflink or via eventutulities-python[provided]) pytest executes correctly.

I was also unable to replicate with a vanilla python dist (might be dealing with stale envs on my end though).

Note: a similar issue with missing deps (protobuf) has been reported when running Flink datastream examples via Helm chart. The issue was not encountered with other examples (datagen) or mediawiki-stream-enrichment. Both cases might be pulling in transient deps.

Key Tasks/Dependencies

Acceptance Criteria

  • pyflink examples and tests work with base flink image
Related tasks:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Summary of chat with @Ottomata.

There's two aspects to address:

  1. Missing python deps (e.g. protobuf) in our docker images. We should meet the install_requires declared in https://github.com/apache/flink/blob/master/flink-python/setup.py#L308.
  2. Flink env variables required for running tests (or any interactive session); we should provide a pytest plugin that sets FLINK_HOME and adds pyflink.zip (and other runtime deps bundled with Flink) to PYTHONPATH. Something similar to https://github.com/malexer/pytest-spark. For the time being (= one single pyflink app to worry about) we can stick env variables config directly in the test suite.

We should meet the install_requires

https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/883278

Trying this, but not having a lot of luck.

All of the instructions and examples I can find around using pyflink all pip install apache-flink. The flink-kubernetes-operator python example, and the instructions on using pyflink on docker do this, and do so even at the expense of installing flink twice in the image! They inherit from the upstream flink image which has flink already installed at /opt/flink!

I've asked the flink mailing list about this, we'll see what they say.

JArguello-WMF renamed this task from [NEEDS GROOMING] Set PYTHONPATH and FLINK_CLASSPATH in Flink docker images. to Set PYTHONPATH and FLINK_CLASSPATH in Flink docker images..Jan 25 2023, 3:11 PM
JArguello-WMF set the point value for this task to 9.
Ottomata renamed this task from Set PYTHONPATH and FLINK_CLASSPATH in Flink docker images. to Flink docker image should work with pyflink.Jan 25 2023, 3:12 PM

No response from mailing list yet, but really, the pip installed flink just works better. I'm going to submit a patch to production-images.

I'm a bit hesitant to use the distribution packaged for pyflink as I'm not sure it's meant to be used that way for submitting production jobs, and I'm not sure to understand why they provide part of the flink distribution again in the python package tbh (for testing perhaps?). Are there other issues than space usage when using the method that upstream suggests at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker ?

why they provide part of the flink distribution again in the python package tbh

Do you mean in the example Dockerfile? FROM flink + pip install apache-flink. I'd assume they did this for convenience. E.g. /docker-entrypoint.sh will be inherited from the upstream flink image, and flink and python deps will be installed by pip.

I'm a bit hesitant to use the distribution packaged for pyflink as I'm not sure it's meant to be used that way for submitting production jobs

I believe it is for pyflink based production jobs. FWIW, this is how we are distributing spark 3 in analytics cluster too.

The pyflink pip module includes all of the same lib jars and bin scripts that come with the flink tarball distribution. From my limited tests, it should be the same, just with all the python stuff just right too.

Are there other issues than space usage when using the method that upstream suggests

I don't know. Is it worth a try?

Another option is to make two different flink base images.

Talked to David in IRC, we are going to give the pyflink based image a go, as long as we include some plugin .jars we need in opt/. Namely the s3 presto connector.

Okay @gmodena @dcausse pyflink based flink-1.16.0-wmf4 image published. flink-fs-s3-presto is in /opt/flink/opt.

Got some responses to my Qs on the Flink mailing list.

[AO]: What is the reason for including opt/python/{pyflink.zip,cloudpickle.zip,py4j.zip} in the base distribution then? Oh, a guess: to make it easier for TaskManagers to run pyflink without having pyflink installed themselves? Somehow I'd guess this wouldn't work tho; I'd assume TaskManagers would also need some python transitive dependencies, e.g. google protobuf.

It has some historical reasons. In the first version (1.9.x) which has not provided Python UDF support, it's not necessary to install PyFlink in the nodes of TaskManagers. Since 1.10 which supports Python UDF, users have to install PyFlink in the nodes of TaskManager as there are many transitive dependencies, e.g. Apache Beam、protobuf、pandas, etc. However, we have not removed these packages as they are still useful for client node which is responsible for compiling jobs(it's not necessary to install PyFlink in the client node).

[AO]: Since we're building our own Docker image, I'm going the other way around: just install pyflink, and symlink /opt/flink -> /usr/lib/python3.7/dist-packages/pyflink. So far so good, but I'm worried that something will be fishy when trying to run JVM apps via pyflink.

Good idea! It contains all the things necessary needed to run JVM apps in the PyFlink package and so I think you could just try this way.

So, we have received blessings from a member of Flink PMC, so fingers crossed it'll all work!

Change 885361 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrichment - use correct image verison v1.0.4

https://gerrit.wikimedia.org/r/885361

Change 885361 merged by Ottomata:

[operations/deployment-charts@master] mw-page-content-change-enrichment - use correct image verison v1.0.4

https://gerrit.wikimedia.org/r/885361