== Problem detection:==
When httpbb checks run concurrent to a deployment, we drop some connections:
```counterexample
Mar 09 10:24:20 cumin1001 systemd[1]: Starting Run httpbb appserver tests hourly on Kubernetes....
Mar 09 10:24:54 cumin1001 sh[3221221]: Sending to mw-web.discovery.wmnet...
Mar 09 10:24:54 cumin1001 sh[3221221]: https://checkuser.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:139)
Mar 09 10:24:54 cumin1001 sh[3221221]: Status code: expected 200, got 503.
Mar 09 10:24:54 cumin1001 sh[3221221]: Body: expected to contain 'CheckUser Wiki', got 'upstream connect error or disconnect/reset before '... (95 characters total).
Mar 09 10:24:54 cumin1001 sh[3221221]: ===
Mar 09 10:24:54 cumin1001 sh[3221221]: FAIL: 126 requests sent to mw-web.discovery.wmnet. 1 request with failed assertions.
```
Other failure mode:
```counterexample
May 30 14:31:47 cumin2002 sh[66204]: https://transitionteam.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_wikimania_wikimedia.yaml:26)
May 30 14:31:47 cumin2002 sh[66204]: ERROR: HTTPSConnectionPool(host='mw-web.discovery.wmnet', port=4450): Max retries exceeded with url: /wiki/Main_Page (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1123)')))
```
== Investigation and possible solutions ==
It looks like we are terminating the pod before it has finished processing the requests, and not redirecting to the other pods early enough.
This happens because of the way kubernetes handles terminating a pod, as described in https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
The SIGTERM to the pod's containers is sent in parallel with the order to remove the pod from EndpointSlices. Envoy shuts down immediately upon receiving SIGTERM, dropping all connections.
Using a preStop hook to drain connections by using [[ https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining | Envoy connection draining ]], as well as sleeping before shutting down php-fpm container seems like a viable strategy.
Max time to termination is set by `terminationGracePeriodSeconds`, with a potential 2s extension if the preStop hooks are not done running.
Useful docs:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining