When restarting eventgate-main using helmfile -e ENV_NAME --state-values-set roll_restart=1 sync it seems that many client requests are failing during the process, i.e. I've seen a spike of 5k mw jobs failing with T249745 (Could not enqueue jobs) during the restart procedure.
Watching the pods during that procedure I had the impression that many pods were restarted at once.
Failing some requests during a rolling restart is probably to be expected but more than 5000k does not seem normal?
In discussion below, it was noted that this was solved for MW as part of T331609: Gracefully handle pod termination in mw-on-k8s. We should do the same for the eventgate Helm chart.
