Page MenuHomePhabricator

[Event Platform] Gracefully handle pod termination in eventgate Helm chart
Closed, ResolvedPublic

Description

When restarting eventgate-main using helmfile -e ENV_NAME --state-values-set roll_restart=1 sync it seems that many client requests are failing during the process, i.e. I've seen a spike of 5k mw jobs failing with T249745 (Could not enqueue jobs) during the restart procedure.

Watching the pods during that procedure I had the impression that many pods were restarted at once.

Failing some requests during a rolling restart is probably to be expected but more than 5000k does not seem normal?

In discussion below, it was noted that this was solved for MW as part of T331609: Gracefully handle pod termination in mw-on-k8s. We should do the same for the eventgate Helm chart.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The problem can also be that we have one component in front of the service (envoyproxy) that gets terminated immediately, while terminating the application on the backend takes more time, so connections are truncated immediately. We just added a pause before terminating envoy in mediawiki for example to overcome this problem.

But also - why are we not retrying to enqueue jobs if it fails? We should probably add a retry on 5xx to the mesh component calling eventgate-main.

This will be a problem not just for jobs, but for all events sent by EventBus.

Is there a way we could force envoyproxy to wait until other containers are shut down?

Also, this seems like it will be a problem for many services, not just EventGate, no? It's just acute in EventGate because it means messages are lost. Ideally k8s wouldn't shut down a pod until it has finished processing any in flight messages (or after some timeout), right? Is there some way to remove ingress while the pod shuts down, so it won't receive any new requests?

Is there some way to remove ingress while the pod shuts down, so it won't receive any new requests?

This happens by default, see https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination But ofc. there is no way for k8s to know if a process is still serving in flight requests, so that needs to be handled by the processes themselves. See T331609: Gracefully handle pod termination in mw-on-k8s / https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928791 for what has been implemented for mediawiki on k8s.

Ah, great. Okay so IIUC, we should

  • upgrade relevant vendor templates in eventgate chart,
  • Add support for main_app.prestop_sleep to eventgate chart container as done for MW here
  • Add support for terminationGracePeriodSeconds on eventgate chart container as done for MW here
  • Set terminationGracePeriodSeconds, mesh.prestop_sleep and main_app.prestop_sleep in the chart's default values.yaml file. And possibly override if needed in helmfiles.

@JMeybohm does that sound right?

Ottomata renamed this task from A rolling restart of eventgate-main seems to cause many client failures to [Event Platform] Gracefully handle pod termination in eventgate Helm chart.Oct 27 2023, 2:11 PM
Ottomata updated the task description. (Show Details)

@JMeybohm does that sound right?

Yes. Although this will still not make envoy actively drain connections. As you can see from the task this is not what we went with in the end.

But also - why are we not retrying to enqueue jobs if it fails? We should probably add a retry on 5xx to the mesh component calling eventgate-main.

Regardless of the the above, this is still a valid question I'd say.

Regardless of the the above, this is still a valid question I'd say.

Indeed!

Although this will still not make envoy actively drain connections

Right, but, if we make this longer than the request timeout, I think the behavior should suffice?

Although this will still not make envoy actively drain connections

Right, but, if we make this longer than the request timeout, I think the behavior should suffice?

Yes

But also - why are we not retrying to enqueue jobs if it fails? We should probably add a retry on 5xx to the mesh component calling eventgate-main.

@Joe, past you did this already? :)

envoy proxy request timeout to eventgate-main is currently 61 seconds though, and EventBus MW PHP sets request timeout to 62 seconds in CommonSettings.php.

Also q: per_try_timeout: "20s" is set on eventgate-main services_proxy, but I don't see this key being used anywhere in the helm charts. I see it mentioned in several helmfile fixtures, but nowhere in the mesh configuration. Should we remove this?

Also q: per_try_timeout: "20s" is set on eventgate-main services_proxy, but I don't see this key being used anywhere in the helm charts. I see it mentioned in several helmfile fixtures, but nowhere in the mesh configuration. Should we remove this?

No. We loop over the retry_policy hash in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/modules/mesh/configuration_1.4.3.tpl#370

Change 971963 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - graceful restart policy with relative timeout/prestop_sleep values

https://gerrit.wikimedia.org/r/971963

Change 971963 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - graceful restart policy with relative timeout/prestop_sleep values

https://gerrit.wikimedia.org/r/971963

Change 971986 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] wgEventServices - add docs about timeout settings

https://gerrit.wikimedia.org/r/971986

Change 971986 merged by jenkins-bot:

[operations/mediawiki-config@master] wgEventServices - add docs about timeout settings

https://gerrit.wikimedia.org/r/971986

Change 971989 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-main - set prestop_sleep and terminiation timeouts

https://gerrit.wikimedia.org/r/971989

Change 971989 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main - set prestop_sleep and terminiation timeouts

https://gerrit.wikimedia.org/r/971989

Mentioned in SAL (#wikimedia-operations) [2023-11-06T16:41:11Z] <ottomata> beginning deployments of eventgate clusters: mesh and cert chart updates, as well as sleep timeout values for graceful envoy+eventgate container termination - T349823 T300033 T346638

Change 972000 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - set default cpu limits to 1

https://gerrit.wikimedia.org/r/972000

Change 972000 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - set default cpu limits to 1

https://gerrit.wikimedia.org/r/972000

Okay, I just applied the prestop_sleep settings to all eventgates. eventgate-main sleeps the longest, a little over 1 minute.

I don't think we have enough evidence yet to call this a huge success, but[[ https://logstash.wikimedia.org/goto/97c528b5b468c1e86f5e285ee4937eb1 | I did not see any JobQueue related failures ]] during my deployment.

Did another eventgate-main deployment just now. I don't see any flood of delivery failures.