Logstash shows that, according to the log of 5xx responses served by Varnish, we serve about 1/minute for intake-logging.wm.o requests.
They have a TTFB of 60 seconds or very close to such, which is about right for some sort of timeout contacting the backend service.
According to EventGate-exported metrics, it is not serving any amount of 5xx response.
So the issue must be at a layer in between those two within the onion of production.
I thought that a small level of CPU throttling on the eventgate-logging-external-tls-proxy k8s job might be the issue, but I mostly fixed that and yet the issue persists. (My theory was that it's hard to mentally model the effect of CPU throttling, and it was an issue at approximately the 'right' location within the onion.)