|Open||None||T244590 [Epic] Rework the WDQS updater as an event driven application|
|Open||None||T264006 Deploy Flink to kubernetes (k8s)|
|Open||Mstyles||T273098 High Availability Flink|
The changes for HA are mostly changing a few lines of the config file. The part that is unclear is whether we can get a service account for pods that allow them to do CRUD operations for configmaps. Pods will need to be able to update the configmap. Details here. Another thing that would have to change is that JobManager pods should be started with their IP address instead of a Kubernetes service as its jobmanager.rpc.address. I'm not sure how that would work.
The alternative to updating configmaps is to run a separate zookeeper cluster, more on that here
I do see that using the configmap election method is appealing as it is build in and does not require additional software to function. Unfortunately I was not able to understand (by briefly reading the docs) if this uses a separate configmap or the one that is actually used for configuring flink.
While the former would be okay-ish I guess, the latter will potentially cause problems as every deployment will result in a re-creation of said configmap by helm. Resetting it to whatever state the chart has defined.
Apart from potentially losing data in that case I'm not 100% certain that helm will handle that properly in every case as I have seen to many weird issues with helm and "manually" altered kubernetes objects.
3 config maps are created by the pods, dispatcher-leader, resourcemanager-leader, and restserver-leader. There aren't any references from helm to these configmaps, so when helm is updated, it shouldn't affect the configmaps.
So one problem with implementing HA without using the session cluster (T280166) is that Flink can't properly start the Streaming Updater job locally, since the jar needs to be stored in the storage directory. I'm not sure of a workaround if we don't end up using the session cluster for local/ci use cases.