Search is currently too busy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kaldari
	Sep 11 2020, 5:08 PM

Description

Report of frequent "Search is currently too busy." errors on English Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Search_busy

Details

	Subject	Repo	Branch	Lines +/-
	cirrus: alert on pool counter reject spike	operations/puppet	production	+12 -1

Customize query in gerrit

Related Objects

Mentioned Here: T262691: High numbers of HTTP 429 errors

Event Timeline

kaldari created this task.Sep 11 2020, 5:08 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptSep 11 2020, 5:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

possible duplicate of T262691

Looking over some things:

16:44 Pool counter rejections for full text search start
16:45 full text QPS increased from 590 to 1100.
16:47 Pool counter rejections peak at 3.1k/min (50/sec).
16:47-17:04 System stays relatively consistent, serving queries at heightened latency and rejecting some
17:08 Pool counter rejections and search thread pool queues return to normal

At first glance, it looks like fulltext search traffic doubled for a period of ~15 minutes. Traffic slowly tapered off and returned to normal operations. Analysis of web logs generated during this period would be required to be more specific about what that increase was.

Open questions:

Pool counter rejection rate was ~50/sec. Search cluster isn't reported as rejecting any requests (none of the thread pool queues filled up and rejected requests). Overall query rate was 800-1100/sec. But the report is that 9 out of 10 searches failed, are we missing this failure type in our tracking somehow, maybe pool counter rejections are under reported?
This overload was slightly different than previous overloads. Specifically none of the search thread pools appears to have hit capacity, and p95 latency only increased to ~1s. Probably not actionable, but slightly different. Anything to do with running out of codfw currently?
This didn't trigger our typical alerting, likely because the p95 latency only straddled 1s, maybe it wasn't over long enough to alert? Or is there something dc switchover related that isn't checking codfw?

Following up here:

Erik and I dug in here and we believe that the graph of pool counter rejections was actually reporting per second whereas it was listed as per minute.

That means we actually hit a peak of about 3.1k rejections per second at 16:47. We've edited the Grafana graph to address the confusion.

Furthermore, we don't seem to have an alert on the number of pool counter rejections, so we should consider adding one. Perhaps warn at the threshold of an average of 1k/sec for 5 minutes straight or something.

CBogen moved this task from needs triage to Current work on the Discovery-Search board.Sep 14 2020, 3:40 PM

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

CBogen moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Moving forward we've decided to add the above proposed alert before closing this ticket.

CBogen assigned this task to RKemper.Sep 14 2020, 3:58 PM

Change 643362 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: alert on pool counter reject spike

https://gerrit.wikimedia.org/r/643362

gerritbot added a project: Patch-For-Review.Nov 24 2020, 10:39 PM

Current patch will warn on 500 cirrus pool counter rejections per second over a moving average window of 5 minutes. There's no critical threshold set (and intentionally so).

RKemper moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Nov 24 2020, 11:36 PM

See https://wikitech.wikimedia.org/wiki/Search#Pool_Counter_rejections_(search_is_currently_too_busy) for the new documentation on this

Change 643362 merged by Ryan Kemper:
[operations/puppet@production] cirrus: alert on pool counter reject spike

https://gerrit.wikimedia.org/r/643362

Maintenance_bot removed a project: Patch-For-Review.Dec 1 2020, 9:10 PM