Page MenuHomePhabricator

insource (regex) search on en.wp's Special:Search does not work
Closed, ResolvedPublic

Description

The following search doesn't work: https://en.wikipedia.org/w/index.php?title=Special:Search&profile=advanced&profile=advanced&fulltext=Search&search=insource%3A%2F%23switch%2F&ns0=1 . A fellow editor had issues with this Special:Search search, and the link didn't produce anything for him or myself, even after minutes of "searching."

Event Timeline

Krenair subscribed.

more likely an issue with the backend than the special page itself

Aklapper renamed this task from Special:Search Issue to Specific search on en.wp's Special:Search does not work.Dec 12 2016, 12:01 PM
Aklapper updated the task description. (Show Details)

Query takes more than 40s here, then displays "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later. "

Probably related to T134157, I wonder if the curl timeout is reached before shard timeout and resulting in an error rather than a warning.

dcausse renamed this task from Specific search on en.wp's Special:Search does not work to insource (regex) search on en.wp's Special:Search does not work.Dec 12 2016, 4:26 PM

Change 326464 had a related patch set uploaded (by DCausse):
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326464

Change 326464 merged by jenkins-bot:
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326464

Change 326909 had a related patch set uploaded (by DCausse):
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326909

Change 326909 merged by jenkins-bot:
Use CirrusSearchClientSideSearchTimeout for connection timeout

https://gerrit.wikimedia.org/r/326909

Change 326948 had a related patch set uploaded (by Hashar):
Bump CirrusSearch

https://gerrit.wikimedia.org/r/326948

I've tried hitting directly a varnish backend in eqiad with curl as follows:

curl -v -H 'X-Forwarded-Proto: https' -H 'Host: en.wikipedia.org' 'http://localhost:3128/w/index.php?title=Special:Search&profile=advanced&profile=advanced&fulltext=Search&search=insource%3A%2F%23switch%2F&ns0=1'

The 200 response came back after 40.239s, which is roughly the same amount of time it takes for a request to go through the whole caching stack (nginx, a few varnishes). That seems to suggest that the problem lies at the app level.

@ema, yes there's a bug right now with the application, the timeout between MW and elastic is wrongly set to 40s.
The problem happened when I tried to deploy a fix to mwdebug to actually use a 80s timeout on the backend, nginx started to return a 504.

The fix https://gerrit.wikimedia.org/r/#/c/326464/ does not seem to solve the issue.
I fear that the shard level timeout is not very accurate, looking at the TimeLimitingCollector implementation in lucene I understand that the timeout check is done only when a doc is collected or when we start collecting a new shard, it's very likely that if the regex does not match a lot of documents and is currently scanning a very large segment the timeout check will be done way too late.
Currently we set shard level timeout to 40s with a client timeout at 80s, doubling seems large but the regex in this ticket still hits the client timeout rather than the shard timeout.
Not sure we can do much on the cirrus side I'll have a look at the plugin to see if I can do something...
Still unsure if we need to re-add the max_inspect circuit breaker in the meantime...

Change 327204 had a related patch set uploaded (by Ema):
debug_proxy: bump proxy_read_timeout

https://gerrit.wikimedia.org/r/327204

Change 327204 merged by Ema:
debug_proxy: bump proxy_read_timeout

https://gerrit.wikimedia.org/r/327204

Change 326989 had a related patch set uploaded (by DCausse):
[cirrus] Reduce regex/default timeouts

https://gerrit.wikimedia.org/r/326989

Change 326989 merged by jenkins-bot:
[cirrus] Reduce regex/default timeouts

https://gerrit.wikimedia.org/r/326989

Mentioned in SAL (#wikimedia-operations) [2016-12-15T14:45:07Z] <dcausse@tin> Synchronized wmf-config/CirrusSearch-common.php: T152895 - [cirrus] Reduce regex/default timeouts (duration: 00m 40s)

This is still not fixed, even with 4x more time allowed for the client timeout...
Checked the threads and yes the regex is still running few seconds after the client timeout is hit.
I'm afraid that shard timeout is not reliable for such queries, I'll ask for guidance on the lucene ml and see if we can do something at the query level.
If no we'll have to rely again on the max inspect to avoid any big issues on the cluster.

Change 328644 had a related patch set uploaded (by DCausse):
Workaround timeout limitation

https://gerrit.wikimedia.org/r/328644

Smalyshev triaged this task as Medium priority.Jan 4 2017, 6:47 PM

Change 330955 had a related patch set uploaded (by DCausse):
Add support for extra regex timeout

https://gerrit.wikimedia.org/r/330955

Change 330955 merged by jenkins-bot:
Add support for extra regex timeout

https://gerrit.wikimedia.org/r/330955

Change 328644 merged by jenkins-bot:
Workaround timeout limitation

https://gerrit.wikimedia.org/r/328644

Do we want to release and deploy this for 2.x, or declare it done and ship it with the 5.x update? It doesn't seem super critical to ship early, but open to disagreement.

I agree with you, I'm ok to declare it done and activate the new "more accurate" timeout when deploying elastic5.

Deskana assigned this task to dcausse.

@dcausse It looks like the timeouts may not be working as expected again. I was testing something on wikitech and issuing the query insource:/@covers/ times out as expected and returns partial results, but searching insource:/@covers/ covers doesn't time out appropriately and instead returns a query failure. Monitoring the task from elasticsearch tasks api shows it takes ~20 minutes to finish.

The actual regex is expensive because @ apparently matches "any string" in the lucene regex syntax and should have been escaped in my query. Still though the inability of the regex to abort seems concerning.

I started to investigate but yes the 20s shard timeout is not reached before the curl request timeout.
Looking at the stacktrace I failed to see one using the source_regex staying around for long...
Asking elastic to return 0 results (to disable the fetch phase) the time spent is reasonable (20-30sec) and the timeout is properly triggered:

dcausse@elastic1041:~$ time curl -XPOST 'localhost:9200/labswiki_content/page/_search?timeout=20s' -d @q.json 
{"took":33967,"timed_out":true,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":11,"max_score":0.0,"hits":[]},"suggest":{"suggest":[{"text":"covers","offset":0,"length":6,"options":[]}]}}
real	0m33.989s
user	0m0.012s
sys	0m0.004s

Removing the highlight query and re-enabling the fetch_phase by asking to return 20 docs I get reasonable times as well:

dcausse@elastic1041:~$ time curl -XPOST 'localhost:9200/labswiki_content/page/_search?timeout=20s' -d @q_no_highlight.json 
{"took":35727,"timed_out":true,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":11,"max_score":18.531656,"hits":[{"_index":"labswiki_content_1521968540","_type":"page","_id":"7559","_score":18.531656,"_source":{"redirect":[{"namespace":0,"title":"Packaging"}],"namespace_text":"","wiki":"labswiki"," [...] estamp":"2016-11-04T01:08:22Z"},"fields":{"text.word_count":[97744]}}]},"suggest":{"suggest":[{"text":"covers","offset":0,"length":6,"options":[]}]}}
real	0m35.750s
user	0m0.008s
sys	0m0.008s

So the problem is definitely in the highlighter.

Ok, so sounds like while semi-related to this patch it's not the same. I'll open a new ticket linked to this for the highlighting.