Page MenuHomePhabricator

Appservers rising GET latency might have triggered LVS pages
Open, MediumPublic

Description

Received a page which almost immediately recovered, no signs of network troubles though:

Service: LVS HTTPS IPv4 #page
Host: text-lb.ulsfo.wikimedia.org
Address: 198.35.26.96
State: CRITICAL

Date/Time: Sat Nov 23 09:02:16 UTC 2019
Notification Type: RECOVERY

Service: LVS HTTPS IPv4 #page
Host: text-lb.ulsfo.wikimedia.org
Address: 198.35.26.96
State: OK

Date/Time: Sat Nov 23 09:03:57 UTC 2019

Upon further investigation it looks like LVS paging checks for text-lb have been alerting in SOFT state for the last few hours: https://logstash.wikimedia.org/goto/34bbf6a96cf59e2f4951dfc4474ec00e (icinga AND text-lb AND "service alert") and I suspect the page was a (un)lucky coincidence that the check failed three times in a row.
The failing checks seem to correlate with a general spike in GET latency for appservers, jumping up on the 23rd at ~00.00: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1574329628748&to=1574502428749&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&panelId=10&fullscreen&var-code=200

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 23 2019, 9:49 AM
fgiunchedi updated the task description. (Show Details)Nov 23 2019, 9:52 AM

The cause was indeed appservers latency, resolving in favor of T238939

Joe added a subscriber: Joe.Nov 25 2019, 8:55 AM

I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latency should not matter.

fgiunchedi reopened this task as Open.Nov 25 2019, 9:45 AM

I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latency should not matter.

That's fair, I'm reopening the task. So far here's what I discovered:

Icinga LVS HTTPS checks for all sites were flapping in SOFT state, and stopped once the restart happened. I suspect one of these flapping eventually got to HARD state and paged, although the issue doesn't seem to affect a site in particular.

https://logstash.wikimedia.org/goto/c0a32e43261f513d04b79e9a2bfcbe00 "Socket timeout after 10 seconds" AND text-lb AND "SERVICE ALERT"

jijiki added a subscriber: jijiki.Nov 25 2019, 5:09 PM
jbond triaged this task as Medium priority.Nov 26 2019, 11:52 AM
jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:46 PM