Page MenuHomePhabricator

Increased latency in appservers - 22 Nov 2019
Closed, ResolvedPublic

Description

Today since around 13:14 UTC, we have observed an increased latency in our avg and 95th percentile

https://grafana.wikimedia.org/d/5E7tdiGWz/xxxx-effie?orgId=1&refresh=30s&panelId=13&fullscreen

https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&panelId=9&fullscreen

response.png (825×1 px, 157 KB)

varnish.png (828×1 px, 110 KB)

  • Deployments around that time seem unrelated
  • There is a failed fetches increase only on cp1083, but doubtful for any issues on this layer

Event Timeline

jijiki updated the task description. (Show Details)

Hi, shouldn't this task be in Unbreak now! priority?

Hi, shouldn't this task be in Unbreak now! priority?

Probably, given I'm investigating on Saturday. But I think there is hope this can be mitigated. I'll keep the task posted.

Mathis_Benguigui triaged this task as Unbreak Now! priority.Nov 23 2019, 12:08 PM

Just to clarify - the situation got worrisome only this morning, when latencies skyrocketed and the issue became user-visible. I'm not sure the two issue are the same, but for convenience I'm going to keep using this task.

Joe claimed this task.

Restarting php-fpm on the affected servers did solve the issue. I decided against doing deeper debugging before restarting the fleet because of the urgency of the fix (the problem became user-visible).