Page MenuHomePhabricator

Appservers rising GET latency might have triggered LVS pages
Closed, ResolvedPublic

Description

Received a page which almost immediately recovered, no signs of network troubles though:

Service: LVS HTTPS IPv4 #page
Host: text-lb.ulsfo.wikimedia.org
Address: 198.35.26.96
State: CRITICAL

Date/Time: Sat Nov 23 09:02:16 UTC 2019
Notification Type: RECOVERY

Service: LVS HTTPS IPv4 #page
Host: text-lb.ulsfo.wikimedia.org
Address: 198.35.26.96
State: OK

Date/Time: Sat Nov 23 09:03:57 UTC 2019

Upon further investigation it looks like LVS paging checks for text-lb have been alerting in SOFT state for the last few hours: https://logstash.wikimedia.org/goto/34bbf6a96cf59e2f4951dfc4474ec00e (icinga AND text-lb AND "service alert") and I suspect the page was a (un)lucky coincidence that the check failed three times in a row.
The failing checks seem to correlate with a general spike in GET latency for appservers, jumping up on the 23rd at ~00.00: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1574329628748&to=1574502428749&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&panelId=10&fullscreen&var-code=200

Event Timeline

The cause was indeed appservers latency, resolving in favor of T238939

I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latency should not matter.

I find it hard to believe this is the case. Text-lb checks request a cached url, so the backend latency should not matter.

That's fair, I'm reopening the task. So far here's what I discovered:

Icinga LVS HTTPS checks for all sites were flapping in SOFT state, and stopped once the restart happened. I suspect one of these flapping eventually got to HARD state and paged, although the issue doesn't seem to affect a site in particular.

https://logstash.wikimedia.org/goto/c0a32e43261f513d04b79e9a2bfcbe00 "Socket timeout after 10 seconds" AND text-lb AND "SERVICE ALERT"

2019-11-25-104141_1093x260_scrot.png (260×1 px, 24 KB)

jbond triaged this task as Medium priority.Nov 26 2019, 11:52 AM
jijiki claimed this task.

Please reopen if needed