Page MenuHomePhabricator

Lots of "EventBus: Unable to deliver all events"
Closed, ResolvedPublic

Description

Error
Unable to deliver all events: 503: Service Unavailable
raw_events (1)
[{"$schema":"/mediawiki/api/request/0.0.1","meta":{"request_id":"XmmUqQpAIDcAAI3IAl4AAADB","id":"8af294a9-00e1-4826-bd8b-82ede83893d1","dt":"2020-03-12T01:47:21Z","domain":"it.wikipedia.org","stream":"mediawiki.api-request"},"http":{"method":"GET","client_ip":" … ,"request_headers":{ …
raw_events (2)
[{"$schema":"/mediawiki/cirrussearch/request/0.0.1","meta":{"request_id":"XmmUkApAICAAAGWZbPUAAAAW","id":"18cfd929-1f76-45b6-a6e0-1a857f67d735","dt":"2020-03-12T01:47:55Z","domain":"ru.wikipedia.org","stream":"mediawiki.cirrussearch-request"},"http":{"method":"POST","client_ip":" … ","has_cookies":false,"request_headers":{"user-agent":"ChangePropagation-JobQueue/WMF"}},"database":"ruwiki","mediawiki_host":"mw1338","search_id":" … ","source":"cli","identity":" … ","request_time_ms":3616,"all_elasticsearch_requests_cached":false,"elasticsearch_requests":[{"query_type":"send_data_write","indices":["ruwiki_content_1567039279"],"request_time_ms":84,"suggestion_requested":false,"cached":false}]}]

See https://logstash.wikimedia.org/goto/e6182add27a0f979277fc8c403feed94

Impact

TBD.

I suspect this EventBus events to Kafka (e.g. api.log) will be missing some entries. I don't see any obvious sign that these were recoverable.

Notes

It looks like while it was seen from time to time before, it started happening much more consistently on wmf.23 and indeed even though we are only on group1, more than 97% of the errors in the past 5 days are only from wmf.23:

Screenshot 2020-03-12 at 01.50.15.png (1×2 px, 89 KB)

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald Transcript

How is the top source enwiki which is running wmf.22?

This might be related to @Joe and @akosiaris recent work to enable TLS envoy proxy for MW HTTPS requests to eventgate. This was first attempted on Monday but reverted, and then finally enabled yesterday. T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities

This is only enabled for eventgate-analytics, to which API requests are sent. It is not yet enabled to eventgate-main.

Yes, @Ottomata is probably right, looks like envoy records these errors as "UC" which means the upstream service closed the connection.

Interestingly, about 40% of the errors ongoing come from monitoring, as in from requests made from pybal to the server. This doesn't make much sense to me.

It also happens in codfw where things should be completely not overloaded.

akosiaris added a project: serviceops.
WDoranWMF added a subscriber: WDoranWMF.

Removing CPT for now, please add us back if we're needed!

How is the top source enwiki which is running wmf.22?

I've misread the piechart. I suspect caused by a lack of pie on my end. Sorry about that.

Interestingly, about 40% of the errors ongoing come from monitoring, as in from requests made from pybal to the server. This doesn't make much sense to me.

To clarify, these are pybal monitoring the MW API, not eventgate itself. I was confused by this. :)

Change 579323 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventgate-analytics: Bump envoy to latest version

https://gerrit.wikimedia.org/r/579323

Change 579323 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] eventgate-analytics: Bump envoy to latest version

https://gerrit.wikimedia.org/r/579323

Change 579324 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventstreams: Drop canaries for a while

https://gerrit.wikimedia.org/r/579324

Change 579324 merged by jenkins-bot:
[operations/deployment-charts@master] eventstreams: Drop canaries for a while

https://gerrit.wikimedia.org/r/579324

akosiaris lowered the priority of this task from High to Medium.Mar 12 2020, 5:16 PM

The changes above (upgrade to latest envoy and dropping canaries) seem to have solved this. But don't resolve please, we 'd like to first revert the canary disabling and make sure it was the old envoy version.

The changes above (upgrade to latest envoy and dropping canaries) seem to have solved this. But don't resolve please, we 'd like to first revert the canary disabling and make sure it was the old envoy version.

Actually, we only seemed to have shaved off an order of magnitude or so.

Change 579343 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] envoy: use non TLS enabled eventgate-analytics

https://gerrit.wikimedia.org/r/579343

Change 579345 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services_proxy: use http endpoint for eventgate-analytics

https://gerrit.wikimedia.org/r/579345

Change 579343 merged by Alexandros Kosiaris:
[operations/puppet@production] envoy: use non TLS enabled eventgate-analytics

https://gerrit.wikimedia.org/r/579343

Change 579345 abandoned by Alexandros Kosiaris:
services_proxy: use http endpoint for eventgate-analytics

Reason:
done in https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /579343/ in a slightly less efficient way

https://gerrit.wikimedia.org/r/579345

akosiaris raised the priority of this task from Medium to High.Mar 12 2020, 6:09 PM

By now, it seems neither of the 2 actions above have had any effect. We are back to levels before the upgrade of envoy or disabling of canaries.

Change 579345 restored by Giuseppe Lavagetto:
services_proxy: use http endpoint for eventgate-analytics

Reason:
Let's fix this one.

https://gerrit.wikimedia.org/r/579345

I think I nailed down the issue. I commented out in the cluster definition in envoy:

max_request_per_connection: 1000

on one appserver, and I saw no errors in 12 minutes already, which would seem like a good result. I'll leave the experiment running for some time on that machine, and if the results are good tomorrow morning I'll apply that change more widely. The error rate is low enough (under 0.05%) that I think we can tolerate spending one night without a fix. If not, the solution is to rollback my change switching eventgate-analytics to use envoy. But I'd use acquiring data for one night.

This should not be a train blocker; it doesn't have anything to do with MW but with some backend configurations. The same issues are happening for all wikis now, including those on wmf.22 from last week. The timing of the backend change just correlates with the wmf.23 deploy, but it not related.

Change 579345 merged by Giuseppe Lavagetto:
[operations/puppet@production] services_proxy: use http endpoint for eventgate-analytics

https://gerrit.wikimedia.org/r/579345

Mentioned in SAL (#wikimedia-operations) [2020-03-13T09:55:14Z] <_joe_> running puppet across appservers to switch to http for eventgate-analytics T247484

Change 579531 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Switch eventgate-analytics back to direct connection in eqiad

https://gerrit.wikimedia.org/r/579531

Change 579531 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch eventgate-analytics back to direct connection in eqiad

https://gerrit.wikimedia.org/r/579531

At 12:16 https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/579531/ was merged, rolling back the switch to envoy based eventgate-analytics access for eqiad. codfw hosts still reach their local eventgate-analytics via envoy (albeit without HTTPS). The errors seems to have disappeared with only 1 being logged (on mw2253) since then.

akosiaris lowered the priority of this task from High to Medium.Mar 13 2020, 3:57 PM

We are going to leave it like this for the weekend. codfw mw hosts talking to local eventgate-analytics via the local envoy+TLS and eqiad mw hosts talking to eventgate-analytics directly and without TLS. We have very few errors now (2-3 per hour), that's acceptable for the weekend.

The number of errors has gone up during the weekend, making this even more absurd.

This is the debug log for a session that fails:

Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: [2020-03-16 06:01:27.669][23936][debug][http] [source/common/http/conn_manager_impl.cc:259] [C3429] new stream
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: [2020-03-16 06:01:27.670][23936][debug][http] [source/common/http/conn_manager_impl.cc:708] [C3429][S13627992005098956010] request headers complete (end_stream=false):
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: ':authority', 'localhost:6004'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: ':path', '/v1/events?hasty=true'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: ':method', 'POST'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'accept', '*/*'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'content-type', 'application/json'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'user-agent', 'wikimedia/multi-http-client v1.0'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'content-length', '502'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: [2020-03-16 06:01:27.670][23936][debug][http] [source/common/http/conn_manager_impl.cc:1257] [C3429][S13627992005098956010] request end stream
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: [2020-03-16 06:01:27.670][23936][debug][http] [source/common/http/conn_manager_impl.cc:1354] [C3429][S13627992005098956010] Sending local reply with details upstream_reset_before_response_started{connection termination}
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: [2020-03-16 06:01:27.670][23936][debug][http] [source/common/http/conn_manager_impl.cc:1552] [C3429][S13627992005098956010] encoding headers via codec (end_stream=false):
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: ':status', '503'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'content-length', '95'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'content-type', 'text/plain'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'date', 'Mon, 16 Mar 2020 06:01:27 GMT'
Mar 16 06:01:27 mw2231 envoyproxy-hot-restarter[30138]: 'server', 'envoy'

This seems related to eventgate and envoy not having the same expectations in terms of keepalive of connections and socket timeouts.

Mentioned in SAL (#wikimedia-operations) [2020-03-16T06:38:20Z] <_joe_> restart envoy with 10 requests per connection on mw2231, T247484

This seems to be a recurring issue with envoy and some upstream applications, see for instance https://github.com/envoyproxy/envoy/issues/8639

As we already expected, the solution they use is disabling persistent connections to that specific backend.

I just tested this on mw2231 and waiting a couple hours before declaring it solves the issue.

At the same time, I'm thinking it would be best to accept (for now) that we sometimes have connection problems and enable retrying on the envoy's side. @akosiaris thoughts?

Change 579903 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::services_proxy: allow defining a retry policy

https://gerrit.wikimedia.org/r/579903

Change 579904 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] eventgate-analytics: allow retries on connection reset

https://gerrit.wikimedia.org/r/579904

This seems to be a recurring issue with envoy and some upstream applications, see for instance https://github.com/envoyproxy/envoy/issues/8639

As we already expected, the solution they use is disabling persistent connections to that specific backend.

I just tested this on mw2231 and waiting a couple hours before declaring it solves the issue.

OK. At least we seem to have a solution

At the same time, I'm thinking it would be best to accept (for now) that we sometimes have connection problems and enable retrying on the envoy's side. @akosiaris thoughts?

If that works, even better. +1 for going forward with this and then fallback on disabling persistent connections if it doesn't work

Change 579903 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::services_proxy: allow defining a retry policy

https://gerrit.wikimedia.org/r/579903

Change 579904 merged by Giuseppe Lavagetto:
[operations/puppet@production] eventgate-analytics: allow retries on connection reset

https://gerrit.wikimedia.org/r/579904

Change 580286 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Switch eventgate-analytics to go through envoy everywhere

https://gerrit.wikimedia.org/r/580286

Change 580286 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch eventgate-analytics to go through envoy everywhere

https://gerrit.wikimedia.org/r/580286

With the retries we're down to a background noise of ~ 10 events / hour, which should be acceptable. Tentatively solving for now.

The numbers are higher than 10/hour, more like 60 / hour.

I did some more digging and I think the problem is that the keepalive timeout in envoy is much higher (1 hour) than on the nodejs side.

This means that we're probably seeing this happen from connections that remain open on the client side because a FIN packet was lost. So I am going to lower the idle timeout to 30 seconds in envoy, and see if this solves the problem or not.

This would also explain why most of such errors happen in codfw, where we make a handful requests per minute to each server, this happens so often.

https://nodejs.org/api/http.html#http_server_keepalivetimeout suggests the keepalive timeout is 5 seconds by default for nodejs servers.

My empirical tests with eventgate seem to confirm that value. So I'll set the same value in the upstream envoy.

Mentioned in SAL (#wikimedia-operations) [2020-03-18T10:52:18Z] <_joe_> setting num_retries=0, idle_timeout=5s on mw2223 for eventgate-analytics in envoy (T247484)

Mentioned in SAL (#wikimedia-operations) [2020-03-18T10:58:33Z] <_joe_> setting num_retries=0 on mw2224 for eventgate-analytics in envoy (T247484)

Change 580893 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services_proxy: allow setting a keepalive timeout

https://gerrit.wikimedia.org/r/580893

Change 580894 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services_proxy: switch from retries to shorter keepalive timeouts

https://gerrit.wikimedia.org/r/580894

Change 580906 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/docker-images/production-images@master] Update envoy, add ability to define an idle timeout

https://gerrit.wikimedia.org/r/580906

Change 580893 merged by Giuseppe Lavagetto:
[operations/puppet@production] services_proxy: allow setting a keepalive timeout

https://gerrit.wikimedia.org/r/580893

Change 580894 merged by Giuseppe Lavagetto:
[operations/puppet@production] services_proxy: switch from retries to shorter keepalive timeouts

https://gerrit.wikimedia.org/r/580894

After one day, what we found is that errors we see now are of the "UF" (upstream failed) type, which seems to point to the upstream having some issue/slowness. I'll re-add a retry policy and all problems should go away.

Change 581415 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services_proxy: re-add retries for eventgate-analytics

https://gerrit.wikimedia.org/r/581415

Change 581415 merged by Giuseppe Lavagetto:
[operations/puppet@production] services_proxy: re-add retries for eventgate-analytics

https://gerrit.wikimedia.org/r/581415

Checking in, how goes?

We're now down to much less than one error per minute, which are all caused by the upstream timing out or disconnecting. The problem is overall solved in terms of limiting the error rate to acceptable levels (less than 1 error in a million calls), I think we can close the task.

I will still work on the side of the kubernetes sidecar, in order to add idle_timeout there too.