Refactor parameterization of eventutilities-python and mediawiki-event-enrichment
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Jan 31 2023, 6:33 PM

Description

Currently, eventutilities-python uses various environment variables for configuration, as well as still has some hardcoded config values( e.g. Kafka brokers).

We should refactor this so that config parameters can be passed in. Some environment variables might be necessary, but the fewer we have the better.

Doing this might be relevant for a larger question of how we'd like to parameterize py(Flink) jobs in general. It would be nice if we had a cool solution that allowed us to use a combination of config files and CLI opts. In refinery-source, we have a Scala ConfigHelper that does just this. E.g.

# config.yaml
my_opt_enabled: true
stream_config_uri: https://meta.wikimedia.org/w/api.php

# Load config from config.yaml, but override my_opt_enabled via CLI.
my_flink_job.py --config_file=./config.yaml --my_opt_enabled=false

Done is

All external configs to stream_manager are parameterized. No hardcoded values.

Details

	Subject	Repo	Branch	Lines +/-
	page_content_change_enrichment - update with latest image and parameterization	operations/deployment-charts	master	+91 -44

Customize query in gerrit

Title	Reference	Author	Source Branch	Dest Branch
Support auto error file sink	repos/data-engineering/eventutilities-python!44	otto	param_test_improvements	main
Add testing/utils.py for use of these tests in job repos	repos/data-engineering/eventutilities-python!43	otto	param_test_improvements	main
Support relative uris in config files	repos/data-engineering/eventutilities-python!42	otto	relative_param_paths	main
Parameterization of stream_manager	repos/data-engineering/eventutilities-python!38	otto	refactor3	main
Refactor on the way to better parameterization	repos/data-engineering/eventutilities-python!35	otto	refactor1	main
Add minimal CLI parameterization for stream descriptors	repos/data-engineering/mediawiki-event-enrichment!14	otto	minimal-args	eventutilities-python-version-bump

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T328478 Refactor parameterization of eventutilities-python and mediawiki-event-enrichment
Resolved	gmodena	T329061 Remove hardcoded kafka parameters
Resolved	gmodena	T331526 eventutilities-python should support using Kafka TLS ports

Event Timeline

Ottomata created this task.Jan 31 2023, 6:33 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJan 31 2023, 6:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thanks for this @Ottomata.

From the point of view of a k8s and helm operator, are there best practices we should follow for structuring configs (e.g. config files vs CLI opts)?

I'd assume that flink jobs submission on k8s would always be managed via helm/helmfile, rather than interactively like tends to happen on hadoop.

Things to be mindful: we should distinguish flink and related runtime configs (e.g. kafka brokers, JVM) that right now we could already set via flink-config.yaml vs library and application specific settings (schema uris & all that, source/destination streams).

In refinery-source, we have a Scala ConfigHelper that does just this.

Would it make sense to provide a Python wrapper to it?

In refinery-source, we have a Scala ConfigHelper that does just this.

Would it make sense to provide a Python wrapper to it?

Hm! Could be worth a try, but I have a feeling this will be more annoying that its worth. The scala ConfigHelper does a lot of stuff with implicit typing to load the correct types from the CLI opts. In python this won't really matter. But...maybe for consistency it would be nice? We could move ConfigHelper to a library somewhere? I dunno, I'm sure there is some python arg parser that will do this nicely.

From the point of view of a k8s and helm operator, are there best practices we should follow for structuring configs (e.g. config files vs CLI opts)?

I don't think so. Config files are nice because they are easier to read at a glance, but, for varying things, debugging, having to edit config files can be a pain. Being able to do both via ConfigHelper has been really really nice!

I'd assume that flink jobs submission on k8s would always be managed via helm/helmfile, rather than interactively like tends to happen on hadoop.

This is true, but the app code should be agnostic to how it is being submitted and run, no?

Things to be mindful: we should distinguish flink and related runtime configs (e.g. kafka brokers, JVM) that right now we could already set via flink-config.yaml vs library and application specific settings (schema uris & all that, source/destination streams)

+1. E.g. jobmanager.rpc.address is flink-conf.yaml. Some things, like checkpoint storage locations can have defaults set in flink-conf.yaml, but still be overridable by the app. Other things are all app specific, e.g. stream config uris. I'd argue the Kafka brokers are app specific too, as flink doesn't care what the data sources and sinks are, but the app does.

I think all the env vars tthat eventutilitites-python has right now (except mayyyyybe EVENTUTILITIES_LIB_DIR...which I hope we can get rid of in T327251) are app settings.

We should ask @dcausse and @bking if they have thoughts too. How do you all handle Flink in k8s application parameterization? What do you think of Flink's Parameter Tool?

ParameterTool is nice but @pfischer has found a really nice abstraction on top of it, you define your options using a set of ConfigOption, the nice thing is that they're strongly typed and can be used with ParameterTools#getConfiguration().
ParameterTools can be constructed from argv or a properties file. Note it does seem to only support Map<String, String> so I'd avoid using yaml as the default format but rather use the simpler java property files format, they should be compatible tho in a way that you can easily use something like {{- toYaml .Values.app.config_file.options }} template function to generate it.

Regarding how to pass the config file:

for wdqs if the first arg is a readable file we use ParameterTool.fromPropertiesFile(firstArg)
for the search job @pfischer has written a wrapper that applies a series of overrides, the config file is detected if it's not an option value and that if it contains :/ (have a kind of URL shape: file:///path/to/file.properties), perhaps this wrapper or a modified version of it would make sense to move to event-utilities?

In T328478#8578535, @Ottomata wrote:

From the point of view of a k8s and helm operator, are there best practices we should follow for structuring configs (e.g. config files vs CLI opts)?

I don't think so. Config files are nice because they are easier to read at a glance, but, for varying things, debugging, having to edit config files can be a pain. Being able to do both via ConfigHelper has been really really nice!

I think it's important for the flink-app chart to have a way to pass and/or populate a config_file, or have a dedicated feature for passing the script options, I was messing around with the args of the jobspec but I think it might be hard and messy to apply the various values-release.yaml -> values-$dc.yaml -> values.yaml overrides with a plain yaml array... I don't think also that you can define your own template once you're defining a helmfile.d service.

In T328478#8578535, @Ottomata wrote:

In refinery-source, we have a Scala ConfigHelper that does just this.

Would it make sense to provide a Python wrapper to it?

Hm! Could be worth a try, but I have a feeling this will be more annoying that its worth. The scala ConfigHelper does a lot of stuff with implicit typing to load the correct types from the CLI opts.

Ack. I guess that in the end what matters most is the config format, rather than the tooling.

From the point of view of a k8s and helm operator, are there best practices we should follow for structuring configs (e.g. config files vs CLI opts)?

I don't think so. Config files are nice because they are easier to read at a glance, but, for varying things, debugging, having to edit config files can be a pain. Being able to do both via ConfigHelper has been really really nice!

Maybe https://pypi.org/project/ConfigArgParse/ does something along the lines of what you described? I used it a while ago, and I need to validate what config file formats it supports. One feature I missed at the time was the possibility of merging config files (with fallbacks). Does ConfigHelper provided that?
Not sure if it's relevant to our use case.

I'd assume that flink jobs submission on k8s would always be managed via helm/helmfile, rather than interactively like tends to happen on hadoop.

This is true, but the app code should be agnostic to how it is being submitted and run, no?

I'd say the app code is orthogonal to how it's configured, but the config parsing/merging logic might not be overly generic (I'm just thinking out loud).
Maybe there are tradeoffs we could make

I think all the env vars tthat eventutilitites-python has right now (except mayyyyybe EVENTUTILITIES_LIB_DIR...which I hope we can get rid of in T327251) are app settings.

+1.

In T328478#8581389, @dcausse wrote:

I think it's important for the flink-app chart to have a way to pass and/or populate a config_file, or have a dedicated feature for passing the script options, I was messing around with the args of the jobspec but I think it might be hard and messy to apply the various values-release.yaml -> values-$dc.yaml -> values.yaml overrides with a plain yaml array... I don't think also that you can define your own template once you're defining a helmfile.d service.

Not sure I follow.
Would this config file provide application only options, or also contain options for kafka/flink?

In T328478#8583469, @gmodena wrote:

In T328478#8581389, @dcausse wrote:

I think it's important for the flink-app chart to have a way to pass and/or populate a config_file, or have a dedicated feature for passing the script options, I was messing around with the args of the jobspec but I think it might be hard and messy to apply the various values-release.yaml -> values-$dc.yaml -> values.yaml overrides with a plain yaml array... I don't think also that you can define your own template once you're defining a helmfile.d service.

Not sure I follow.
Would this config file provide application only options, or also contain options for kafka/flink?

This is for the application options yes.
I might have jumped into trying to find a solution to my problem before explaining what my problem actually is :)
Our jobs have plenty of args (30ish) and this becomes rapidly cumbersome to write in helmfile values file: (e.g. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886005/4/helmfile.d/dse-k8s-services/rdf-streaming-updater/values-dse-k8s-eqiad.yaml#11). Problem is that the JobSpec forces the use of an array for passing the args.
This leads to two problems I think:

it's easy to lose track of the argname and its corresponding argvalue when using arrays, only proper manual formatting does help
I doubt that helmfile will be able to properly merge this array while applying the different levels of values files

First point is more of an annoyance than a serious problem, the ability to do overrides is bit more critical to us because we may have to repeat all these options for every environments/job we deploy, for wikidata/commons it's 2jobs (wdqs/wcqs) * 3env (eqiad/codfw/staging), 2*3*30 key values option pairs to write and maintain.
As of today we take care of all these params in a yaml file and the python script that is used to deploy the flink jobs to the session cluster, it's not something we'd like to keep but it's reasonably DRY.
I wish that the flink-app chart provided some tooling to help with that.

I doubt that helmfile will be able to properly merge this array while applying the different levels of values files

True, I don't think it merges arrays, although there might be a way to make the flink-app chart do so? But, that would probably get extra confusing.

Our jobs have plenty of args (30ish) and this becomes rapidly cumbersome to write in helmfile values file

Maybe this is sort of possible now, if the app supports config files and with CLI opt overrides?

# values.yaml
...
args: [
  '--config_file', 'file:///srv/app/config.yaml'
]

# values-dse-k8s-eqiad.yaml
...
args: [
  '--config_file': 'file:///srv/app/config.yaml',
  '--kafka_brokers', 'kafka-jumbo1001.eqiad.wmnet,...'
  ...
]

So, all common configs are in config.yaml, but any deployment specific settings are in args?

I wish that the flink-app chart provided some tooling to help with that.

But yeah, I see what you are saying. Even if we had both config files and CLI opt overrides, Because the JobSpec args won't be merged in any way, there is going to repetition.

I think we could provide a flink-app helper template to make this easier. It would probably have to be opinionated about whatever common arg parser we'd like to use for Flink app code. This might be a good reason to rely on a JVM based one for python apps too (unfortunetly). Maybe we should try refinery-sources ConfigHelper? Oof, its scala though, so I dunno?

Too bad cuz https://pypi.org/project/ConfigArgParse/ does look really nice.

command line > environment variables > config file values > defaults)

Maybe there is a similar java parser?

In T328478#8589620, @Ottomata wrote:

I wish that the flink-app chart provided some tooling to help with that.

But yeah, I see what you are saying. Even if we had both config files and CLI opt overrides, Because the JobSpec args won't be merged in any way, there is going to repetition.

I think we could provide a flink-app helper template to make this easier. It would probably have to be opinionated about whatever common arg parser we'd like to use for Flink app code. This might be a good reason to rely on a JVM based one for python apps too (unfortunetly). Maybe we should try refinery-sources ConfigHelper? Oof, its scala though, so I dunno?

What I had in mind is some tooling available from the flink-app chart to create yaml files from the values file:

in the top-level values.yaml file you would set:

app:
   config_files:
      my_app_config:
          path: /srv/app/my_app_config.properties # this file would be generated from the flink-app chart using {{ toYaml }}
          content:
             option1: value1
             option2: value2
             brokers: changeme
   job:
      args: ["/srv/app/my_app_config.properties"] # or ["--config_file", "/srv/app/my_app_config.properties"] if you want, it'd be job specif

in values-eqiad.yaml:

app:
   config_files:
      my_app_config:
          content:
             brokers: "broker1.eqiad.wmnet:9093,broker2.eqiad.wmnet:9093"

Interesting, and let the helm dict merging of e.g. config_files.my_app_config.content handle the creation of merged config files? Then the app doesn't do any fancy config merging itself?

Might be a simple solution. Too bad Java doesn't have a better config_file + CLI opt parser. I suppose Java Properties kind of work, but then setting them on CLI with -D'...' is a little annoying.

This does mean that all flink apps that use this need to know how to read a single config file, right?

This might be a good reason to rely on a JVM based one for python apps too (unfortunetly). Maybe we should try refinery-sources ConfigHelper? Oof, its scala though, so I dunno?

With a little thought, this won't work. Any parser that requires creation of a Java or Scala class won't work for python. We need a parser that returns something generic, perhaps java.util.Properties.

Or, we just use a different solution for python than for JVM.

In T328478#8590013, @Ottomata wrote:

Interesting, and let the helm dict merging of e.g. config_files.my_app_config.content handle the creation of merged config files? Then the app doesn't do any fancy config merging itself?

Might be a simple solution. Too bad Java doesn't have a better config_file + CLI opt parser. I suppose Java Properties kind of work, but then setting them on CLI with -D'...' is a little annoying.

This does mean that all flink apps that use this need to know how to read a single config file, right?

yes, the app would then be forced to have a feature to load options from a config file (in yaml that is compatible with the java properties syntax if you keep your options flat) if it wants to benefit from the overrides done by helmfile. It can ultimately solely rely on the args param if not.

In T328478#8590055, @Ottomata wrote:

Or, we just use a different solution for python than for JVM.

As long as the two solutions are interoperable (I would like to account for polyglot code bases), and we stick to well supported file formats. it might not be too bad.

I like the idea of using Java properties in this case, since we are already interfacing with JVM systems to begin with.

In T328478#8590253, @dcausse wrote:

yes, the app would then be forced to have a feature to load options from a config file (in yaml that is compatible with the java properties syntax if you keep your options flat) if it wants to benefit from the overrides done by helmfile. It can ultimately solely rely on the args param if not.

In this scenario, would helm be responsible of parsing my_app_config.properties and merging output in its own dict?

I think that for the time being we might want to retain compatibility with Yarn (even if "unofficially"), even if it leads to some code duplication.

In this scenario, would helm be responsible of parsing my_app_config.properties and merging output in its own dict?

I think the idea is to rely on helm to create a final merged my_app_config.properties file that that app reads in. This would still be compatible with Yarn, as long as you have one my_app_config.properties file and don't support any other CLI opts.

gmodena mentioned this in T329061: Remove hardcoded kafka parameters.Feb 7 2023, 3:10 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Feb 10 2023, 12:38 PM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Feb 10 2023, 12:44 PM

lbowmaker closed subtask T329061: Remove hardcoded kafka parameters as Resolved.Feb 17 2023, 2:49 PM

Ottomata moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.Mar 7 2023, 9:47 PM

otto opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/14

Add minimal CLI parameterization for stream descriptors

gerritbot added a project: Patch-For-Review.Mar 7 2023, 10:02 PM

otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/14

Add minimal CLI parameterization for stream descriptors

Maintenance_bot removed a project: Patch-For-Review.Mar 8 2023, 11:10 AM

Ottomata mentioned this in T331526: eventutilities-python should support using Kafka TLS ports.Mar 8 2023, 12:41 PM