insource (regex) search on en.wp's Special:Search does not work
Closed, ResolvedPublic
Actions

Description

The following search doesn't work: https://en.wikipedia.org/w/index.php?title=Special:Search&profile=advanced&profile=advanced&fulltext=Search&search=insource%3A%2F%23switch%2F&ns0=1 . A fellow editor had issues with this Special:Search search, and the link didn't produce anything for him or myself, even after minutes of "searching."

Details

Subject	Repo	Branch	Lines +/-
Workaround timeout limitation	search/extra	master	+302 -17
Add support for extra regex timeout	mediawiki/extensions/CirrusSearch	master	+25 -6
[cirrus] Reduce regex/default timeouts	operations/mediawiki-config	master	+10 -2
debug_proxy: bump proxy_read_timeout	operations/puppet	production	+1 -0
Use CirrusSearchClientSideSearchTimeout for connection timeout	mediawiki/extensions/CirrusSearch	wmf/1.29.0-wmf.5	+19 -3
Bump CirrusSearch	mediawiki/core	wmf/1.29.0-wmf.5	+1 -1
Use CirrusSearchClientSideSearchTimeout for connection timeout	mediawiki/extensions/CirrusSearch	master	+19 -3

Customize query in gerrit

Related Objects

Mentioned In: T196180: Track queries that run on the elasticsearch clusters longer than some timeout
T195491: regex highlighting doesn't always timeout as expected
Mentioned Here: T134157: Use builtin timeout support instead of max_inspect for insource queries

Event Timeline

JustBerry created this task.Dec 11 2016, 5:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2016, 5:31 AM

more likely an issue with the backend than the special page itself

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptDec 11 2016, 5:42 AM

(Wikimedia sites don't use MediaWiki-Search hence setting CirrusSearch )

Aklapper renamed this task from Special:Search Issue to Specific search on en.wp's Special:Search does not work.Dec 12 2016, 12:01 PM

Aklapper updated the task description. (Show Details)

Query takes more than 40s here, then displays "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later. "

Probably related to T134157, I wonder if the curl timeout is reached before shard timeout and resulting in an error rather than a warning.

dcausse renamed this task from Specific search on en.wp's Special:Search does not work to insource (regex) search on en.wp's Special:Search does not work.Dec 12 2016, 4:26 PM

Change 326464 had a related patch set uploaded (by DCausse):
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326464

gerritbot added a project: Patch-For-Review.Dec 12 2016, 4:27 PM

dcausse moved this task from needs triage to Current work on the Discovery-Search board.Dec 12 2016, 7:01 PM

dcausse edited projects, added Discovery-Search (Current work); removed Discovery-Search.

dcausse moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Change 326464 merged by jenkins-bot:
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326464

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2016-12-13_(1.29.0-wmf.6)).Dec 13 2016, 2:00 AM

Change 326909 had a related patch set uploaded (by DCausse):
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326909

Change 326909 merged by jenkins-bot:
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326909

Change 326948 had a related patch set uploaded (by Hashar):
Bump CirrusSearch

https://gerrit.wikimedia.org/r/326948

Change 326948 merged by Hashar:
Bump CirrusSearch

https://gerrit.wikimedia.org/r/326948

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2016-12-06_(1.29.0-wmf.5)).Dec 13 2016, 3:00 PM

I've tried hitting directly a varnish backend in eqiad with curl as follows:

curl -v -H 'X-Forwarded-Proto: https' -H 'Host: en.wikipedia.org' 'http://localhost:3128/w/index.php?title=Special:Search&profile=advanced&profile=advanced&fulltext=Search&search=insource%3A%2F%23switch%2F&ns0=1'

The 200 response came back after 40.239s, which is roughly the same amount of time it takes for a request to go through the whole caching stack (nginx, a few varnishes). That seems to suggest that the problem lies at the app level.

@ema, yes there's a bug right now with the application, the timeout between MW and elastic is wrongly set to 40s.
The problem happened when I tried to deploy a fix to mwdebug to actually use a 80s timeout on the backend, nginx started to return a 504.

debt moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Dec 13 2016, 6:10 PM

The fix https://gerrit.wikimedia.org/r/#/c/326464/ does not seem to solve the issue.
I fear that the shard level timeout is not very accurate, looking at the TimeLimitingCollector implementation in lucene I understand that the timeout check is done only when a doc is collected or when we start collecting a new shard, it's very likely that if the regex does not match a lot of documents and is currently scanning a very large segment the timeout check will be done way too late.
Currently we set shard level timeout to 40s with a client timeout at 80s, doubling seems large but the regex in this ticket still hits the client timeout rather than the shard timeout.
Not sure we can do much on the cirrus side I'll have a look at the plugin to see if I can do something...
Still unsure if we need to re-add the max_inspect circuit breaker in the meantime...

Change 327204 had a related patch set uploaded (by Ema):
debug_proxy: bump proxy_read_timeout

https://gerrit.wikimedia.org/r/327204

Change 327204 merged by Ema:
debug_proxy: bump proxy_read_timeout

https://gerrit.wikimedia.org/r/327204

Change 326989 had a related patch set uploaded (by DCausse):
[cirrus] Reduce regex/default timeouts

https://gerrit.wikimedia.org/r/326989

Change 326989 merged by jenkins-bot:
[cirrus] Reduce regex/default timeouts

https://gerrit.wikimedia.org/r/326989

Mentioned in SAL (#wikimedia-operations) [2016-12-15T14:45:07Z] <dcausse@tin> Synchronized wmf-config/CirrusSearch-common.php: T152895 - [cirrus] Reduce regex/default timeouts (duration: 00m 40s)

This is still not fixed, even with 4x more time allowed for the client timeout...
Checked the threads and yes the regex is still running few seconds after the client timeout is hit.
I'm afraid that shard timeout is not reliable for such queries, I'll ask for guidance on the lucene ml and see if we can do something at the query level.
If no we'll have to rely again on the max inspect to avoid any big issues on the cluster.

Change 328644 had a related patch set uploaded (by DCausse):
Workaround timeout limitation

https://gerrit.wikimedia.org/r/328644

Smalyshev triaged this task as Medium priority.Jan 4 2017, 6:47 PM

Change 330955 had a related patch set uploaded (by DCausse):
Add support for extra regex timeout

https://gerrit.wikimedia.org/r/330955

Change 330955 merged by jenkins-bot:
Add support for extra regex timeout

https://gerrit.wikimedia.org/r/330955

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-01-17_(1.29.0-wmf.8)).Jan 17 2017, 6:00 PM

Change 328644 merged by jenkins-bot:
Workaround timeout limitation

https://gerrit.wikimedia.org/r/328644

Do we want to release and deploy this for 2.x, or declare it done and ship it with the 5.x update? It doesn't seem super critical to ship early, but open to disagreement.

I agree with you, I'm ok to declare it done and activate the new "more accurate" timeout when deploying elastic5.

• Deskana closed this task as Resolved.Feb 3 2017, 4:54 PM

• Deskana assigned this task to dcausse.

@dcausse It looks like the timeouts may not be working as expected again. I was testing something on wikitech and issuing the query insource:/@covers/ times out as expected and returns partial results, but searching insource:/@covers/ covers doesn't time out appropriately and instead returns a query failure. Monitoring the task from elasticsearch tasks api shows it takes ~20 minutes to finish.

The actual regex is expensive because @ apparently matches "any string" in the lucene regex syntax and should have been escaped in my query. Still though the inability of the regex to abort seems concerning.

EBernhardson reopened this task as Open.May 23 2018, 8:37 PM

EBernhardson moved this task from Needs Reporting to Needs review on the Discovery-Search (Current work) board.

I started to investigate but yes the 20s shard timeout is not reached before the curl request timeout.
Looking at the stacktrace I failed to see one using the source_regex staying around for long...
Asking elastic to return 0 results (to disable the fetch phase) the time spent is reasonable (20-30sec) and the timeout is properly triggered:

dcausse@elastic1041:~$ time curl -XPOST 'localhost:9200/labswiki_content/page/_search?timeout=20s' -d @q.json 
{"took":33967,"timed_out":true,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":11,"max_score":0.0,"hits":[]},"suggest":{"suggest":[{"text":"covers","offset":0,"length":6,"options":[]}]}}
real	0m33.989s
user	0m0.012s
sys	0m0.004s

Removing the highlight query and re-enabling the fetch_phase by asking to return 20 docs I get reasonable times as well:

dcausse@elastic1041:~$ time curl -XPOST 'localhost:9200/labswiki_content/page/_search?timeout=20s' -d @q_no_highlight.json 
{"took":35727,"timed_out":true,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":11,"max_score":18.531656,"hits":[{"_index":"labswiki_content_1521968540","_type":"page","_id":"7559","_score":18.531656,"_source":{"redirect":[{"namespace":0,"title":"Packaging"}],"namespace_text":"","wiki":"labswiki"," [...] estamp":"2016-11-04T01:08:22Z"},"fields":{"text.word_count":[97744]}}]},"suggest":{"suggest":[{"text":"covers","offset":0,"length":6,"options":[]}]}}
real	0m35.750s
user	0m0.008s
sys	0m0.008s

So the problem is definitely in the highlighter.

Ok, so sounds like while semi-related to this patch it's not the same. I'll open a new ticket linked to this for the highlighting.

EBernhardson closed this task as Resolved.May 24 2018, 2:55 PM

EBernhardson mentioned this in T195491: regex highlighting doesn't always timeout as expected.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

EBernhardson removed projects: MW-1.29-release (WMF-deploy-2017-01-17_(1.29.0-wmf.8)), MW-1.29-release (WMF-deploy-2016-12-06_(1.29.0-wmf.5)).

EBernhardson mentioned this in T196180: Track queries that run on the elasticsearch clusters longer than some timeout.Jun 1 2018, 5:44 PM