Page MenuHomePhabricator

Increased latency in appservers - 22 Nov 2019
Closed, ResolvedPublic

Description

Today since around 13:14 UTC, we have observed an increased latency in our avg and 95th percentile

https://grafana.wikimedia.org/d/5E7tdiGWz/xxxx-effie?orgId=1&refresh=30s&panelId=13&fullscreen

https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&panelId=9&fullscreen

  • Deployments around that time seem unrelated
  • There is a failed fetches increase only on cp1083, but doubtful for any issues on this layer

Event Timeline

jijiki created this task.Nov 22 2019, 6:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 22 2019, 6:03 PM
CDanis added a subscriber: CDanis.Nov 22 2019, 7:30 PM

At ~18:36 there was another spike in long-tail latency, but then, latency seemed to return to 'normal':
https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1574443410296&to=1574450056679

jijiki updated the task description. (Show Details)Nov 22 2019, 10:10 PM
jijiki updated the task description. (Show Details)

Hi, shouldn't this task be in Unbreak now! priority?

Joe added a subscriber: Joe.Nov 23 2019, 12:05 PM

Hi, shouldn't this task be in Unbreak now! priority?

Probably, given I'm investigating on Saturday. But I think there is hope this can be mitigated. I'll keep the task posted.

Mathis_Benguigui triaged this task as Unbreak Now! priority.Nov 23 2019, 12:08 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptNov 23 2019, 12:08 PM
Joe added a comment.Nov 23 2019, 12:11 PM

Just to clarify - the situation got worrisome only this morning, when latencies skyrocketed and the issue became user-visible. I'm not sure the two issue are the same, but for convenience I'm going to keep using this task.

Joe closed this task as Resolved.Nov 23 2019, 12:24 PM
Joe claimed this task.

Restarting php-fpm on the affected servers did solve the issue. I decided against doing deeper debugging before restarting the fleet because of the urgency of the fix (the problem became user-visible).