Looking over some things:
16:44 Pool counter rejections for full text search start
16:45 full text QPS increased from 590 to 1100.
16:47 Pool counter rejections peak at 3.1k/min (50/sec).
16:47-17:04 System stays relatively consistent, serving queries at heightened latency and rejecting some
17:08 Pool counter rejections and search thread pool queues return to normal
At first glance, it looks like fulltext search traffic doubled for a period of ~15 minutes. Traffic slowly tapered off and returned to normal operations. Analysis of web logs generated during this period would be required to be more specific about what that increase was.
- Pool counter rejection rate was ~50/sec. Search cluster isn't reported as rejecting any requests (none of the thread pool queues filled up and rejected requests). Overall query rate was 800-1100/sec. But the report is that 9 out of 10 searches failed, are we missing this failure type in our tracking somehow, maybe pool counter rejections are under reported?
- This overload was slightly different than previous overloads. Specifically none of the search thread pools appears to have hit capacity, and p95 latency only increased to ~1s. Probably not actionable, but slightly different. Anything to do with running out of codfw currently?
- This didn't trigger our typical alerting, likely because the p95 latency only straddled 1s, maybe it wasn't over long enough to alert? Or is there something dc switchover related that isn't checking codfw?
Following up here:
Erik and I dug in here and we believe that the graph of pool counter rejections was actually reporting per second whereas it was listed as per minute.
That means we actually hit a peak of about 3.1k rejections per second at 16:47. We've edited the Grafana graph to address the confusion.
Furthermore, we don't seem to have an alert on the number of pool counter rejections, so we should consider adding one. Perhaps warn at the threshold of an average of 1k/sec for 5 minutes straight or something.