Sustained periods (2-4h) of bad latency on production-search eqiad
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Dec 24 2019, 4:25 PM

Description

There has been at least 3 occasions in the last week, each on different day, in which the 95% percentile raised over 1 second -the 50% percentile increases too, from ~10ms to ~30ms, but that is less worrying. Last one started at 15:47 (Dec 24).

https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&from=1576647991218&to=1577229020000

Screenshot_20200103_110932.png (930×2 px, 127 KB)

This could be due to traffic (sorry I could not find traffic statistics to confirm that), in which case, there are not many actionables here, but reporting for awareness, to evaluate impact/importance, and in case there is something internally (e.g. indexing, configuration), that could cause it or mitigate the slowdown.

Details

Subject	Repo	Branch	Lines +/-
vcl: block bot aggressively hitting search API	operations/puppet	production	+34 -0
AWS search block: anchor regex to start	operations/puppet	production	+1 -1
block search API traffic from one U-A running on AWS	operations/puppet	production	+12 -1

Customize query in gerrit

Event Timeline

jcrespo created this task.Dec 24 2019, 4:25 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptDec 24 2019, 4:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

My bet would be media indexing, as an alert on Image uploads coincided, with some shift (which could be explained by the asynchronous nature), but I don't have hard proof of this, other than the matching times.

Edit: one event showed correlation, but no longer showing this, as I didn't see it in the following instances

This has continued in the last 7 days: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&from=1576888676529&to=1577452275258&var-cirrus_group=eqiad&var-cluster=eqiad&var-exported_cluster=production-search&var-smoothing=1

Screenshot_20191227_141234.png (933×2 px, 154 KB)

I believe this is caused by a bot sending a large amount of requests of type:
/w/api.php?format=json&action=query&prop=revisions&list=search&srsearch=search+query
using the UA: wikipedia (https://github.com/goldsmith/Wikipedia/)

from 811 ip addresses which seem to belong to amazon (not checked all of them)

Filtering these requests I can match the pattern we see in grafana;

(Y is requests/hour)
QPS in grafana for the same period:

Removing these requests for the same period I see the usual pattern:

I have a notebook with the IPs on notebook1004 named abuse_goldsmith_ua_from_amazon.
The search cluster cannot support such traffic as it doubles the number of fulltext searches we have to serve.
Pinging traffic for suggestions on how to proceed.

Restricted Application added a project: SRE. · View Herald TranscriptDec 31 2019, 5:15 PM

Change 561300 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] block one U-A running on AWS

https://gerrit.wikimedia.org/r/561300

gerritbot added a project: Patch-For-Review.Dec 31 2019, 6:56 PM

Change 561300 merged by CDanis:
[operations/puppet@production] block search API traffic from one U-A running on AWS

https://gerrit.wikimedia.org/r/561300

Change 561304 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] anchor regex to start

https://gerrit.wikimedia.org/r/561304

Change 561304 merged by CDanis:
[operations/puppet@production] AWS search block: anchor regex to start

https://gerrit.wikimedia.org/r/561304

Change 561322 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] *properly* block search traffic of just one UA from AWS

https://gerrit.wikimedia.org/r/561322

dcausse moved this task from needs triage to Current work on the Discovery-Search board.Jan 2 2020, 1:44 PM

dcausse edited projects, added Discovery-Search (Current work); removed Discovery-Search.

dcausse moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.

• ema triaged this task as High priority.Jan 3 2020, 10:04 AM

• ema moved this task from Backlog to Caching on the Traffic board.

jcrespo updated the task description. (Show Details)Jan 3 2020, 10:09 AM

In T241421#5768632, @dcausse wrote:

I believe this is caused by a bot sending a large amount of requests of type:
/w/api.php?format=json&action=query&prop=revisions&list=search&srsearch=search+query
using the UA: wikipedia (https://github.com/goldsmith/Wikipedia/)

Rather than being one specific bot, the UA in question is a Python library, hence it's likely used by various different bots.

The search cluster cannot support such traffic as it doubles the number of fulltext searches we have to serve.
Pinging traffic for suggestions on how to proceed.

search=srsearch traffic from that User-Agent seems to have gone back to levels that do not cause trouble lately: https://w.wiki/EqB

@CDanis and I have worked on a VCL patch to block this specific type of traffic from that User-Agent that can be merged if the abusive traffic starts again: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/561322/

Clearly, however, this can easily turn in a cat and mouse game. What is the per-IP request per second rate that the search cluster can support? With this information, we can enforce specific throttling at the edge, different from what we use for the mw api in general (currently set to 100/s).

In T241421#5773665, @ema wrote:

Clearly, however, this can easily turn in a cat and mouse game. What is the per-IP request per second rate that the search cluster can support? With this information, we can enforce specific throttling at the edge, different from what we use for the mw api in general (currently set to 100/s).

I'm not convinced that per-IP limits are going to help much. During this incident, the per-IP request rate was reasonable, but the traffic came from a number of different AWS instances. @dcausse is still doing some number crunching and might come up with a better answer, but this seems unlikely.

Gehel moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Feb 10 2020, 6:38 PM

TJones closed this task as Resolved.Feb 12 2020, 4:35 PM

TJones claimed this task.

	F31499287: Screenshot_20200103_110932.png
	Jan 3 2020, 10:09 AM

	F31490177: Screenshot_20191227_141234.png
	Dec 27 2019, 1:12 PM

Sustained periods (2-4h) of bad latency on production-search eqiadClosed, ResolvedPublicActions

Description

Details

Event Timeline

Sustained periods (2-4h) of bad latency on production-search eqiad
Closed, ResolvedPublic
Actions