Page MenuHomePhabricator

Gracefully handle pod termination in mw-on-k8s
Closed, ResolvedPublic

Description

Problem detection:

When httpbb checks run concurrent to a deployment, we drop some connections:

Mar 09 10:24:20 cumin1001 systemd[1]: Starting Run httpbb appserver tests hourly on Kubernetes....
Mar 09 10:24:54 cumin1001 sh[3221221]: Sending to mw-web.discovery.wmnet...
Mar 09 10:24:54 cumin1001 sh[3221221]: https://checkuser.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:139)
Mar 09 10:24:54 cumin1001 sh[3221221]:     Status code: expected 200, got 503.
Mar 09 10:24:54 cumin1001 sh[3221221]:     Body: expected to contain 'CheckUser Wiki', got 'upstream connect error or disconnect/reset before '... (95 characters total).
Mar 09 10:24:54 cumin1001 sh[3221221]: ===
Mar 09 10:24:54 cumin1001 sh[3221221]: FAIL: 126 requests sent to mw-web.discovery.wmnet. 1 request with failed assertions.

Other failure mode:

May 30 14:31:47 cumin2002 sh[66204]: https://transitionteam.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_wikimania_wikimedia.yaml:26)
May 30 14:31:47 cumin2002 sh[66204]:     ERROR: HTTPSConnectionPool(host='mw-web.discovery.wmnet', port=4450): Max retries exceeded with url: /wiki/Main_Page (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1123)')))

Investigation and possible solutions

It looks like we are terminating the pod before it has finished processing the requests, and not redirecting to the other pods early enough.

This happens because of the way kubernetes handles terminating a pod, as described in https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

The SIGTERM to the pod's containers is sent in parallel with the order to remove the pod from EndpointSlices. Envoy shuts down immediately upon receiving SIGTERM, dropping all connections.

Using a preStop hook to drain connections by using Envoy connection draining, as well as sleeping before shutting down php-fpm container seems like a viable strategy.

Max time to termination is set by terminationGracePeriodSeconds, with a potential 2s extension if the preStop hooks are not done running.

Useful docs:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining

Event Timeline

Clement_Goubert added a project: serviceops.
Clement_Goubert moved this task from Incoming đŸ« to đŸŒ»Mediawiki on the serviceops board.
Clement_Goubert renamed this task from httpbb fails requesting mw-web during deployments to httpbb fails requesting mw-on-k8s during deployments.May 30 2023, 3:14 PM
Clement_Goubert raised the priority of this task from Medium to High.
Clement_Goubert updated the task description. (Show Details)
Clement_Goubert updated the task description. (Show Details)
Clement_Goubert added subscribers: Volans, RLazarus.

Change 925776 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add terminationGracePeriodSeconds

https://gerrit.wikimedia.org/r/925776

Change 925776 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Handle pod termination gracefully

https://gerrit.wikimedia.org/r/925776

Change 927599 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: Bump envoy version for drain tests

https://gerrit.wikimedia.org/r/927599

Change 927599 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Bump envoy version for drain tests

https://gerrit.wikimedia.org/r/927599

Change 927685 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Test sleeping before draining in envoy

https://gerrit.wikimedia.org/r/927685

Change 927685 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Test sleeping before draining in envoy

https://gerrit.wikimedia.org/r/927685

Change 927722 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: restore sleep after envoy drain

https://gerrit.wikimedia.org/r/927722

Change 927722 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: restore sleep after envoy drain

https://gerrit.wikimedia.org/r/927722

Change 927999 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Graceful termination

https://gerrit.wikimedia.org/r/927999

Clement_Goubert renamed this task from httpbb fails requesting mw-on-k8s during deployments to Gracefully handle pod termination in mw-on-k8s.Jun 7 2023, 10:48 AM
Clement_Goubert claimed this task.
Clement_Goubert updated the task description. (Show Details)

Currently testing the following timings:

  • envoy: sleep 2s then drain with a 5s timeout, wait for incoming connections to reach 0 then SIGTERM
  • php-fpm: just sleep for 10s
  • terminationGracePeriodSeconds: set to 10s, so with the potential 2s extension for preStop execution it'll kill the pod after 12s

Envoy's draining is set to immediate so it immediately starts discouraging connections.

Those timings will probably go down, target time for termination is around 5s, while dropping as few connections as possible.

Change 927999 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Graceful termination

https://gerrit.wikimedia.org/r/927999

Change 928475 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Continue deployment downtime tests

https://gerrit.wikimedia.org/r/928475

Change 928475 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Continue deployment downtime tests

https://gerrit.wikimedia.org/r/928475

Change 928537 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: Change sleep debug to 4s

https://gerrit.wikimedia.org/r/928537

Change 928537 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Change sleep debug to 4s

https://gerrit.wikimedia.org/r/928537

Change 928544 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add redeployment annotation

https://gerrit.wikimedia.org/r/928544

Change 928544 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add redeployment annotation

https://gerrit.wikimedia.org/r/928544

Change 928552 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Allow to choose between drain and sleep

https://gerrit.wikimedia.org/r/928552

Change 928552 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Allow to choose between drain and sleep

https://gerrit.wikimedia.org/r/928552

Change 928791 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] Backport preStop sleep and draining changes

https://gerrit.wikimedia.org/r/928791

Change 929678 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Bump envoy image version to 1.18.3-2-s2

https://gerrit.wikimedia.org/r/929678

Change 929678 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Bump envoy image version to 1.18.3-2-s2

https://gerrit.wikimedia.org/r/929678

Change 929706 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Gracefully handle termination

https://gerrit.wikimedia.org/r/929706

Change 929706 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Gracefully handle termination

https://gerrit.wikimedia.org/r/929706

Mentioned in SAL (#wikimedia-operations) [2023-06-15T15:12:28Z] <claime> Deploying new mediawiki chart: Gracefully handle termination - T331609

In the end, we went with simple sleep calls and only left draining as an option if we want it.

  • envoy sleeps for 7 seconds in order to serve even the longest POST requests our appservers get (p99 spikes are around 4s)
  • All the other containers sleep for 8 seconds so envoy is the first to quit and we don't serve 503 or lose metrics.

These timings can be fine tuned as needed through the various prestop_sleep configuration options.

Deployment testing shows no more dropped requests, resolving.

I noticed there are still differences in mw-debug config between codfw and eqiad originating from this task (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928475/3/helmfile.d/services/mw-debug/values-codfw.yaml) what is the desired state here?

Change 968959 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: Revert envoy draining tests

https://gerrit.wikimedia.org/r/968959

Change 968959 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Revert envoy draining tests

https://gerrit.wikimedia.org/r/968959

Thanks for the patch @Clement_Goubert , I just deployed it.