Page MenuHomePhabricator

Search is currently too busy
Closed, ResolvedPublic

Description

Report of frequent "Search is currently too busy." errors on English Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Search_busy

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looking over some things:

16:44 Pool counter rejections for full text search start
16:45 full text QPS increased from 590 to 1100.
16:47 Pool counter rejections peak at 3.1k/min (50/sec).
16:47-17:04 System stays relatively consistent, serving queries at heightened latency and rejecting some
17:08 Pool counter rejections and search thread pool queues return to normal

At first glance, it looks like fulltext search traffic doubled for a period of ~15 minutes. Traffic slowly tapered off and returned to normal operations. Analysis of web logs generated during this period would be required to be more specific about what that increase was.

Open questions:

  • Pool counter rejection rate was ~50/sec. Search cluster isn't reported as rejecting any requests (none of the thread pool queues filled up and rejected requests). Overall query rate was 800-1100/sec. But the report is that 9 out of 10 searches failed, are we missing this failure type in our tracking somehow, maybe pool counter rejections are under reported?
  • This overload was slightly different than previous overloads. Specifically none of the search thread pools appears to have hit capacity, and p95 latency only increased to ~1s. Probably not actionable, but slightly different. Anything to do with running out of codfw currently?
  • This didn't trigger our typical alerting, likely because the p95 latency only straddled 1s, maybe it wasn't over long enough to alert? Or is there something dc switchover related that isn't checking codfw?

Following up here:

Erik and I dug in here and we believe that the graph of pool counter rejections was actually reporting per second whereas it was listed as per minute.

That means we actually hit a peak of about 3.1k rejections per second at 16:47. We've edited the Grafana graph to address the confusion.

Furthermore, we don't seem to have an alert on the number of pool counter rejections, so we should consider adding one. Perhaps warn at the threshold of an average of 1k/sec for 5 minutes straight or something.

Moving forward we've decided to add the above proposed alert before closing this ticket.

Change 643362 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: alert on pool counter reject spike

https://gerrit.wikimedia.org/r/643362

Current patch will warn on 500 cirrus pool counter rejections per second over a moving average window of 5 minutes. There's no critical threshold set (and intentionally so).

Change 643362 merged by Ryan Kemper:
[operations/puppet@production] cirrus: alert on pool counter reject spike

https://gerrit.wikimedia.org/r/643362