Page MenuHomePhabricator

Upstream connect error on API Portal wiki
Closed, ResolvedPublic

Description

I intermittently get an almost blank page that has the following message at the top when visiting a page on the production API Portal wiki:

{"httpCode":503,"httpReason":"upstream connect error or disconnect/reset before headers. reset reason: connection termination"}

Event Timeline

I've gotten the same error when trying to access the main page a few times. It usually resolves itself on refresh.

Some logs:

[2020-09-10 20:01:03.438][69][debug][pool] [source/common/conn_pool/conn_pool_base.cc:121] [C13338] using existing connection
[2020-09-10 20:01:03.438][69][debug][pool] [source/common/conn_pool/conn_pool_base.cc:73] [C13338] creating stream
[2020-09-10 20:01:03.438][69][debug][router] [source/common/router/upstream_request.cc:342] [C13917][S16695905485085360567] pool ready
[2020-09-10 20:01:03.439][69][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:225] [C13338] 
[2020-09-10 20:01:03.439][69][debug][connection] [source/common/network/connection_impl.cc:584] [C13338] remote close
[2020-09-10 20:01:03.439][69][debug][connection] [source/common/network/connection_impl.cc:208] [C13338] closing socket: 0
[2020-09-10 20:01:03.439][69][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:297] [C13338] SSL shutdown: rc=-1
[2020-09-10 20:01:03.439][69][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:225] [C13338] 
[2020-09-10 20:01:03.439][69][debug][client] [source/common/http/codec_client.cc:92] [C13338] disconnect. resetting 1 pending requests
[2020-09-10 20:01:03.439][69][debug][client] [source/common/http/codec_client.cc:115] [C13338] request reset
[2020-09-10 20:01:03.439][69][debug][router] [source/common/router/router.cc:1022] [C13917][S16695905485085360567] upstream reset: reset reason connection termination
[2020-09-10 20:01:03.439][69][debug][http] [source/common/http/conn_manager_impl.cc:1520] [C13917][S16695905485085360567] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2020-09-10 20:01:03.439][69][debug][http] [source/common/http/conn_manager_impl.cc:1777] [C13917][S16695905485085360567] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '128'
'content-type', 'application/json'
'date', 'Thu, 10 Sep 2020 20:01:03 GMT'
'server', 'envoy'

So, envoy is trying to reuse connection to MW. We can set the max_requests_per_connection to 1 for requests going to MW clusters. @Joe is there a way to reuse connections to MW? I believe not...

Change 627322 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/deployment-charts@master] api-gateway: disable connection reuse

https://gerrit.wikimedia.org/r/627322

I don't understand SSL shutdown: rc=-1 - doesn't look like anyone has ever hit this before. Weirdly ssl_socket.cc:225 _should_ be reporting details about an error https://github.com/envoyproxy/envoy/blob/v1.15.0/source/extensions/transport_sockets/tls/ssl_socket.cc#L225

That error in turn comes from openssl's SSL_shutdown failing:

-1  The shutdown was not successful because a fatal error occurred
      either at the protocol level or a connection failure occurred. It
      can also occur if action is need to continue the operation for non-
      blocking BIOs.  Call SSL_get_error(3) with the return value ret to
      find out the reason.

Getting a consistent reproduction on this would be really handy :(

My guess is that the empty message happens because this loop never gets an iteration, hence the empty string. https://github.com/envoyproxy/envoy/blob/v1.15.0/source/extensions/transport_sockets/tls/ssl_socket.cc#L207

Change 627559 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/deployment-charts@master] api-gateway: limit persistent connections to attempt to address SSL issues.

https://gerrit.wikimedia.org/r/627559

Change 627322 abandoned by Hnowlan:
[operations/deployment-charts@master] api-gateway: disable connection reuse

Reason:

https://gerrit.wikimedia.org/r/627322

Change 627559 merged by jenkins-bot:
[operations/deployment-charts@master] api-gateway: limit persistent connections to attempt to address SSL issues.

https://gerrit.wikimedia.org/r/627559

I'm still seeing this error in the logs, which is concerning. However, less concerning is not seeing any more active connections being closed: https://grafana.wikimedia.org/d/UOH-5IDMz/api-gateway?viewPanel=38&orgId=1&from=1600161750381&to=1600248150381&refresh=30s&var-datasource=codfw%20prometheus%2Fk8s

I'm not sure if this is resolved

Based on the graph linked above and the lack of reports I'm going to close this for now - if anyone sees this again please reopen and let me know a rough timestamp!

Thanks, Hugh! I haven't seen the error recently, but I'll let you know if it comes up again.