Page MenuHomePhabricator

Elasticsearch instabilities on eqiad search cluster
Closed, ResolvedPublic

Description

Saturday August 6 from 7:55 to 8:15am UTC, we have experienced a significant slow down on elasticsearch eqiad cluster. The issue started with curl pool rejection before 7:55am on mediawiki API servers and peaked with threadpool rejections on elasticsearch. The issue solved itself with no external interaction. The following days, multiple spikes on curl rejections from API server were observed.

We observed higher than usual system load on all but 3 elasticsearch servers during the issue. The 3 servers not having higher load are 3 servers not hosting any shard from enwiki. We suspect that the issue is related to more expensive enwiki queries being run at that time.

Investigation are ongoing with the help of @dcausse.

Event Timeline

Gehel created this task.Aug 9 2016, 5:51 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptAug 9 2016, 5:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as High priority.Aug 9 2016, 5:59 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

The hotthread log reports some phraseFreq which is an indication that a costly phrase query is probably running, we could probably check CirrusSearchRequestSet in hive for suspicious queries with a " for enwiki during that period.

EBernhardson added a comment.EditedAug 15 2016, 5:29 PM

Ran a few exploratory queries, first verify we collected relevant info in hive:

This first query checks the # of queries and the time spent on those queries. Outside of the problem area it's reporting 50-60k queries/minute and around 8e6 ms/minute spent on search requests. Entering the problem area request rate drops to 10-15k queries/minute and the time spent jumps to 1.5e8 ms/minute. This lets me know we certainly recorded the problem area and what data to zoom in on.

Results in: P3830

select minute, count(1) as count, sum(tookms) as sum, sum(tookms)/count(1) as ms_per_req
  from (select from_unixtime(60*floor(ts/60)) as minute, tookms
          from wmf_raw.cirrussearchrequestset
          where year=2016 and month=8 and day=6 and hour between 7 and 8
       ) x
 group by minute
 order by minute asc
 limit 9999;

Next look is at the top 100 requests by time spent during the problem period. This doesn't end up too interesting, lots of write requests, a variety of namespace lookups that probably would have then done another request if they had finished. Results in P3831 (contains raw user queries, so under WMF-NDA policy so i don't have to sanitize). The interesting thing here is the timing, we have requests that issued two namespace queries to elasticsearch and took up to 5 minutes to complete. With some digging I've found that we don't appropriately set any timeouts on namespace requests. I highly doubt this is the source of our problem, but it's certainly something we should fix.

select from_unixtime(ts), tookms, size(requests), requests.querytype, requests.query
  from wmf_raw.cirrussearchrequestset
 where year=2016 and month=8 and day=6 and hour between 7 and 8
 order by tookms desc
 limit 100;

So...nothing conclusive found except a bug that might exacerbate the problem by not timing out early enough. Will ponder some more.

We don't seem to be making much progress here, will move to done column unless anyone has ideas.

debt closed this task as Resolved.Sep 9 2016, 8:02 PM
debt claimed this task.