[Event Platform] eventgate-wikimedia occasionally fails to produce events due schema fetch errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Nov 7 2023, 6:37 PM

Description

In T326002: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors, we resolved event production failures due to a couple of stream config fetching related bugs:

dynamic stream config lookup (used by eventgate-analytics-external) was buggy, and also tried to fetch with too many stream query params
If stream config fetching failed, there was no retrying

Fixes for these bugs been deployed. While verifying that these fixed our stream config fetching errors, I noticed that there are similar looking errors around fetching schemas "Failed loading schema for ..."

We should have better retries when fetching schemas. But even so, I was curious as to why this happened so often.

eventgate uses a envoy tls proxy sidecar service mesh to talk to other remote services, including schema.discovery.wmnet. I looked at some of the logs of the service mesh, and saw these 503s while looking up the schemas. 503s are strange for this, as the remote schema service is definitely up and responding. So, perhaps envoy proxy is failing somehow?

To test this, I used nsenter to assume the network namespace of an eventgate-analytics-external production pod in codfw. I then issued a GET for one of the schema fetch failures I've been seeing. I can reproduce the 503, not always but often.

root@kubernetes2041:/home/otto# docker ps | grep eventgate-production-8476f8c84f-nxp47 | grep tls-proxy
root@kubernetes2041:/home/otto# docker top a0938a0b03e5
root@kubernetes2041:/home/otto# nsenter -t 3257653 -n
root@kubernetes2041:/home/otto# curl -I  'http://127.0.0.1:6023/repositories/secondary/jsonschema//analytics/mobile_apps/android_daily_stats/1.0.0'
HTTP/1.1 503 Service Unavailable

I cannot reproduce a 503 on any tries to backend servers (schema200[34]), schema.svc.codfw.wmnet, schema.discovery.wmnet. Only when I issue this request via the eventgate-production-tls-proxy do I get 503s.

The error message provided back to eventgate is

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure

Why do requests to this service proxy fail occasionally?

Details

Subject	Repo	Branch	Lines +/-
schema service proxy - set keepalive to 10s	operations/puppet	production	+2 -1
schema service proxy - Add retry on 5xx	operations/puppet	production	+3 -0
Set envoy schema proxy keepalive timeout to 10s	operations/puppet	production	+1 -1
schema.svc - add keeplive_timeout param, default to 1h	operations/puppet	production	+12 -0

Customize query in gerrit

Related Objects

Mentioned In: T326002: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors
Mentioned Here: T263043: [Bug] The feed/featured endpoint is broken
T326002: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors

Event Timeline

Ottomata created this task.Nov 7 2023, 6:37 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptNov 7 2023, 6:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ottomata updated the task description. (Show Details)Nov 7 2023, 6:37 PM

Ottomata updated the task description. (Show Details)

Ottomata mentioned this in T326002: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors.Nov 7 2023, 6:41 PM

I also see these 503s to envoy schema :6023 in eventstreams-internal. This is more visible here because eventstreams-internal is configured to re-request its stream configs and schemas every 5 minutes.

I turned on envoy debug logging on an eventstreams-internal eventstreams-main-tls-proxy container.

I see a burst of 503s when the service refetches schemas. It does not 503 on all schema requests during this refetch, only a handful of them.

Here's the request and response debug log for one of those requests:

[2023-11-08T15:23:21.452Z] "GET /repositories/secondary/jsonschema/analytics/mediawiki/searchpreview/latest HTTP/1.1" 503 UC 0 95 0 - "-" "-" "059f6a8f-c6d4-481b-91ae-1b7ec8ea3e5d" "127.0.0.1:6023" "10.2.1.43:443"
...
[2023-11-08 15:23:21.452][31][debug][http] [source/common/http/conn_manager_impl.cc:909] [C150661][S18441316538769435508] request headers complete (end_stream=true):
':authority', '127.0.0.1:6023'
':path', '/repositories/secondary/jsonschema/analytics/mediawiki/searchpreview/latest'
':method', 'GET'
'accept-encoding', 'gzip, deflate'
'content-length', '0'
'connection', 'keep-alive'

[2023-11-08 15:23:21.452][31][debug][http] [source/common/http/filter_manager.cc:790] [C150661][S18441316538769435508] request end stream

[2023-11-08 15:23:21.453][31][debug][http] [source/common/http/filter_manager.cc:901] [C150661][S18441316538769435508] Sending local reply with details upstream_reset_before_response_started{connection_termination}
[2023-11-08 15:23:21.453][31][debug][http] [source/common/http/conn_manager_impl.cc:1533] [C150661][S18441316538769435508] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Wed, 08 Nov 2023 15:23:21 GMT'
'server', 'envoy'

I'm not certain, but I'm pretty sure this is the corresponding nginx access.log entry on the backend (schema2004.codfw.wmnet):

10.192.48.162 - - [08/Nov/2023:15:23:21 +0000] "GET /repositories/secondary/jsonschema/analytics/mediawiki/searchpreview/latest HTTP/1.1" 200 4590 "-" "-"

So, from the backend's perspective, this should be a 200 response. It looks like something between the backend and the app's envoy tls proxy is prematurely terminating the connection?

Some googling indicates that this could possibly be caused by keepalive timeout mismatches between envoy and the upstream backend?

/etc/nginx/nginx.conf has keepalive_timeout 65 on the backend (the default should be 75, don't know where this 65 comes from).

IIUC, the relevant envoy settting is common_http_protocol_options.idle_timeout which defaults to 1 hour. It looks like we configure this setting in puppet hieradata via the profile::services_proxy::envoy::listeners timeout setting. The schema lister does not configure this, so I guess it's idle_timeout is 1h?

I don't think this adds up though, the upstream is disconnected during a burst of requests, which should mean that the connection is not idle at all, and shouldn't be terminated. Plus, we don't see any logs about terminating the request on the upstream, so something in the middle must be doing this?

Ottomata claimed this task.Nov 8 2023, 8:29 PM

Ottomata moved this task from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 4) board.

Change 973186 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] schema.svc - add keeplive_timeout param, default to 1h

https://gerrit.wikimedia.org/r/973186

gerritbot added a project: Patch-For-Review.Nov 9 2023, 3:33 PM

Change 973188 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Set envoy schema proxy keepalive timeout to 10s

https://gerrit.wikimedia.org/r/973188

Change 973186 abandoned by Ottomata:

[operations/puppet@production] schema.svc - add keeplive_timeout param, default to 1h

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/973188 should be better

https://gerrit.wikimedia.org/r/973186

Change 973188 merged by Ottomata:

[operations/puppet@production] Set envoy schema proxy keepalive timeout to 10s

https://gerrit.wikimedia.org/r/973188

Maintenance_bot removed a project: Patch-For-Review.Nov 9 2023, 4:10 PM

@Joe's suggestion:

my suggestion with envoy is to set its idle timeout well below the limit of the keepalive on its upstream. so set it to 10 seconds, say

As far as I can tell worked. eventstreams-internal's periodic 503s are gone. I just redeployed eventgate-analytics-external, and saw no schema svc 503s.

Hooray!

Ottomata moved this task from In progress to Done on the Data Engineering and Event Platform Team (Sprint 4) board.Nov 9 2023, 4:11 PM

Ah, nope. I spoke too soon:

eventstreams: https://grafana.wikimedia.org/goto/3SjKIj4Ik?orgId=1
eventgate: https://grafana.wikimedia.org/goto/1Gj2Ij4Sk?orgId=1

The keepalive timeout doesn't seem related to me anymore. eventstreams-internal is re-requesting schemas every 5 minutes. After I redeployed with the envoy -> schema.svc keepalive_timeout set to 10s, it took about 23 minutes before 503s started happening again. After I redeployed eventgate with the same change, it took about 18 minutes before 503s started happening again.

Still stumped...

Ottomata moved this task from Done to In progress on the Data Engineering and Event Platform Team (Sprint 4) board.Nov 9 2023, 4:42 PM

Very similar sounding issue: https://github.com/istio/istio/issues/13205. Still reading...

Change 973223 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] schema service proxy - Add retry on 5xx

https://gerrit.wikimedia.org/r/973223

Change 973223 merged by Ottomata:

[operations/puppet@production] schema service proxy - Add retry on 5xx

https://gerrit.wikimedia.org/r/973223

Maintenance_bot removed a project: Patch-For-Review.Nov 9 2023, 7:10 PM

Added a retry_policy for 5xx, but still getting 503s.

Change 973234 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] schema service proxy - set max_requests_per_connection: 1 to avoid upstream disconnect

https://gerrit.wikimedia.org/r/973234

gerritbot added a project: Patch-For-Review.Nov 9 2023, 8:43 PM

Change 973234 merged by Ottomata:

[operations/puppet@production] schema service proxy - set keepalive to 10s

https://gerrit.wikimedia.org/r/973234

Maintenance_bot removed a project: Patch-For-Review.Nov 9 2023, 9:10 PM

lbowmaker moved this task from Incoming (new tickets) to Event Platform Maintenance (current quarter) on the Data-Engineering board.Nov 10 2023, 1:43 PM

Okay, it turns out I set the wrong value for keepalive timeout on envoy in our service_proxies settings. Fixed in https://gerrit.wikimedia.org/r/973234, deployed to eventstreams-internal yesterday and there have been no 503s since!

Just deployed to eventgate-analytics-external, and unlike the previous deploys, there are no huge spikes of 503s! YESSSSSS

Let's watch https://logstash.wikimedia.org/goto/cbd123c70c83e0dad4a522b00bceaed8 for 503s over the weekend, and also watch for any new produce_canary_events email alerts. I suspect/hope they will be gone!

lbowmaker edited projects, added Data-Engineering (Sprint 5); removed Data Engineering and Event Platform Team (Sprint 4), Data-Engineering.Nov 10 2023, 2:13 PM

lbowmaker moved this task from Next Up to In progress on the Data-Engineering (Sprint 5) board.

phuedx awarded a token.Nov 10 2023, 2:18 PM

Ottomata moved this task from In progress to Done on the Data-Engineering (Sprint 5) board.Nov 13 2023, 3:03 PM

Ottomata moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.Nov 21 2023, 8:35 PM

lbowmaker closed this task as Resolved.Dec 1 2023, 1:02 PM

Kappakayala removed a project: serviceops.Dec 9 2023, 1:14 AM