Page MenuHomePhabricator

Search is currently too busy
Open, Needs TriagePublic

Description

Report of frequent "Search is currently too busy." errors on English Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Search_busy

Event Timeline

kaldari created this task.Sep 11 2020, 5:08 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptSep 11 2020, 5:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

possible duplicate of T262691

EBernhardson added a subscriber: EBernhardson.EditedSep 11 2020, 5:41 PM

Looking over some things:

16:44 Pool counter rejections for full text search start
16:45 full text QPS increased from 590 to 1100.
16:47 Pool counter rejections peak at 3.1k/min (50/sec).
16:47-17:04 System stays relatively consistent, serving queries at heightened latency and rejecting some
17:08 Pool counter rejections and search thread pool queues return to normal

At first glance, it looks like fulltext search traffic doubled for a period of ~15 minutes. Traffic slowly tapered off and returned to normal operations. Analysis of web logs generated during this period would be required to be more specific about what that increase was.

Open questions:

  • Pool counter rejection rate was ~50/sec. Search cluster isn't reported as rejecting any requests (none of the thread pool queues filled up and rejected requests). Overall query rate was 800-1100/sec. But the report is that 9 out of 10 searches failed, are we missing this failure type in our tracking somehow, maybe pool counter rejections are under reported?
  • This overload was slightly different than previous overloads. Specifically none of the search thread pools appears to have hit capacity, and p95 latency only increased to ~1s. Probably not actionable, but slightly different. Anything to do with running out of codfw currently?
  • This didn't trigger our typical alerting, likely because the p95 latency only straddled 1s, maybe it wasn't over long enough to alert? Or is there something dc switchover related that isn't checking codfw?
RKemper added a subscriber: RKemper.EditedSep 11 2020, 10:17 PM

Following up here:

Erik and I dug in here and we believe that the graph of pool counter rejections was actually reporting per second whereas it was listed as per minute.

That means we actually hit a peak of about 3.1k rejections per second at 16:47. We've edited the Grafana graph to address the confusion.

Furthermore, we don't seem to have an alert on the number of pool counter rejections, so we should consider adding one. Perhaps warn at the threshold of an average of 1k/sec for 5 minutes straight or something.

CBogen moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Moving forward we've decided to add the above proposed alert before closing this ticket.

CBogen assigned this task to RKemper.Sep 14 2020, 3:58 PM