Page MenuHomePhabricator

~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface
Open, LowPublic

Description

Logstash shows that, according to the log of 5xx responses served by Varnish, we serve about 1/minute for intake-logging.wm.o requests.

They have a TTFB of 60 seconds or very close to such, which is about right for some sort of timeout contacting the backend service.

According to EventGate-exported metrics, it is not serving any amount of 5xx response.

So the issue must be at a layer in between those two within the onion of production.

I thought that a small level of CPU throttling on the eventgate-logging-external-tls-proxy k8s job might be the issue, but I mostly fixed that and yet the issue persists. (My theory was that it's hard to mentally model the effect of CPU throttling, and it was an issue at approximately the 'right' location within the onion.)

Event Timeline

CDanis created this task.

Clients will retry automatically so this isn't a huge deal, but it does merit investigation at some point.

Idea: Could missing-revisions (T215001) be related to this?

fdans added a subscriber: fdans.

Just pinging @Ottomata for when he's back from vacation.

Ottomata added a project: Analytics-Kanban.

Interesting. So we don't know exactly where the timeout is occurring? Assigning to me to remember to look into this.

BBlack added a subscriber: BBlack.

Removing Traffic for now - although it could get added back if some further investigation indicates our infra is the cause (and that it's fixable and worth fixing)

Just a quick update that, after a year, this is still happening, on both intake-analytics (50x/minute =~ 0.8x/sec) and intake-logging (1-2x/minute).