Page MenuHomePhabricator

> ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface
Closed, ResolvedPublic

Description

Logstash shows that, according to the log of 5xx responses served by Varnish, we serve about 1/minute for intake-logging.wm.o requests.

They have a TTFB of 60 seconds or very close to such, which is about right for some sort of timeout contacting the backend service.

According to EventGate-exported metrics, it is not serving any amount of 5xx response.

So the issue must be at a layer in between those two within the onion of production.

I thought that a small level of CPU throttling on the eventgate-logging-external-tls-proxy k8s job might be the issue, but I mostly fixed that and yet the issue persists. (My theory was that it's hard to mentally model the effect of CPU throttling, and it was an issue at approximately the 'right' location within the onion.)

Event Timeline

CDanis created this task.

Clients will retry automatically so this isn't a huge deal, but it does merit investigation at some point.

Idea: Could missing-revisions (T215001) be related to this?

fdans subscribed.

Just pinging @Ottomata for when he's back from vacation.

Ottomata added a project: Analytics-Kanban.

Interesting. So we don't know exactly where the timeout is occurring? Assigning to me to remember to look into this.

BBlack subscribed.

Removing Traffic for now - although it could get added back if some further investigation indicates our infra is the cause (and that it's fixable and worth fixing)

Just a quick update that, after a year, this is still happening, on both intake-analytics (50x/minute =~ 0.8x/sec) and intake-logging (1-2x/minute).

Change 790289 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the connect_timeout for eventgate based services

https://gerrit.wikimedia.org/r/790289

Change 790289 abandoned by Btullis:

[operations/deployment-charts@master] Increase the connect_timeout for eventgate based services

Reason:

We don't think that it is this connect_timeout after all. Rethinking.

https://gerrit.wikimedia.org/r/790289

Are we sure that this is a service side issue? this sounds a lot like a FetchError triggered by the client going away/connection being interrupted before varnish gets the whole POST body (and that triggers a 503 issued by Varnish). ATS seems to believe that eventgate-logging-external.discovery.wmnet is rather healthy: https://grafana.wikimedia.org/goto/QKFq4aD4k?orgId=1

CDanis renamed this task from ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface to > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface.Feb 22 2023, 3:13 PM
CDanis claimed this task.

I think you are right @Vgutierrez, thanks