In T326002: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors, we resolved event production failures due to a couple of stream config fetching related bugs:
- dynamic stream config lookup (used by eventgate-analytics-external) was buggy, and also tried to fetch with too many stream query params
- If stream config fetching failed, there was no retrying
Fixes for these bugs been deployed. While verifying that these fixed our stream config fetching errors, I noticed that there are similar looking errors around fetching schemas "Failed loading schema for ..."
We should have better retries when fetching schemas. But even so, I was curious as to why this happened so often.
eventgate uses a envoy tls proxy sidecar service mesh to talk to other remote services, including schema.discovery.wmnet. I looked at some of the logs of the service mesh, and saw these 503s while looking up the schemas. 503s are strange for this, as the remote schema service is definitely up and responding. So, perhaps envoy proxy is failing somehow?
To test this, I used nsenter to assume the network namespace of an eventgate-analytics-external production pod in codfw. I then issued a GET for one of the schema fetch failures I've been seeing. I can reproduce the 503, not always but often.
root@kubernetes2041:/home/otto# docker ps | grep eventgate-production-8476f8c84f-nxp47 | grep tls-proxy root@kubernetes2041:/home/otto# docker top a0938a0b03e5 root@kubernetes2041:/home/otto# nsenter -t 3257653 -n root@kubernetes2041:/home/otto# curl -I 'http://127.0.0.1:6023/repositories/secondary/jsonschema//analytics/mobile_apps/android_daily_stats/1.0.0' HTTP/1.1 503 Service Unavailable
I cannot reproduce a 503 on any tries to backend servers (schema200[34]), schema.svc.codfw.wmnet, schema.discovery.wmnet. Only when I issue this request via the eventgate-production-tls-proxy do I get 503s.
The error message provided back to eventgate is
upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure
Why do requests to this service proxy fail occasionally?