Page MenuHomePhabricator

Regexes in search queries can sometimes return fewer search results than they should
Closed, ResolvedPublic

Description

In de.wikipedia, searching for insource:/[0-9]°C/ gives only 3 results, while searching for insource:/0°C/ gives 442.

Event Timeline

FriedhelmW raised the priority of this task from to Needs Triage.
FriedhelmW updated the task description. (Show Details)
FriedhelmW added a project: CirrusSearch.
FriedhelmW added a subscriber: FriedhelmW.
Restricted Application added a project: Discovery. · View Herald TranscriptJul 23 2015, 2:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Restricted Application added a subscriber: Luke081515. · View Herald TranscriptJul 24 2015, 9:20 AM
Ironholds moved this task from Needs triage to Search on the Discovery board.Aug 4 2015, 8:18 AM
dcausse added a subscriber: dcausse.EditedSep 8 2015, 1:01 PM

I'm just adding a technical note to record the current state of my investigations concerning this issue.

insource regex uses a 2 pass technique. The first pass will try to speedup the query by finding documents with a 3-gram index. The second pass will run the regex on the matching docs (the recheck phase).

I think this query triggers multiple insource limitations in the first pass:

  1. the regular expression must contains at least 3 contiguous characters in order to run over the 3-gram index
  2. a character range won't be expanded if it matches more than 4 characters, here [0-9] should be expanded as 10 characters but the limitation will prevent this expansion.

There is two workarounds but none of them will satisfy the original query.

For example :

  • we can run the insource:/[0-9]°C / with a trailing space which will allow to workaround the first limitation but would only brings results where C is followed by a space.
  • we can limit the expansion to 4 chars by running multiple queries:
    1. /[0-3]°C/
    2. /[4-7]°C/
    3. /[8-9]°C/

In order to correctly fix this issue we will have to increase the limits (maxExpand param which defaults to 4). Unfortunately these limits are here to guarantee that a query cannot hurt the global performance of the system.
If the insource is not able to run with the accelerated first pass it will inspect only 10000 pages and then stop. This should explain why this query returns only 2 or 3 results, only 2 or 3 pages matched the regex within the first 10000 pages inspected.

ksmith set Security to None.
Deskana renamed this task from Too few search results to Regexes in search queries can sometimes return fewer search results than they should.Dec 4 2015, 5:24 AM
Deskana triaged this task as Low priority.
Deskana added a subscriber: Deskana.

Although this issue has been rated as "low priority" i just want to mention that the CirrusSearch is important for many people that want to improve wikitext by identifying and working on systematic errors. Especially on Wikisource and Commons this is really needed because there are only a few people working on a huge ammount of Wikitext. Therefore, it would be nice to see some activity on this issue.

Although this issue has been rated as "low priority" i just want to mention that the CirrusSearch is important for many people that want to improve wikitext by identifying and working on systematic errors. Especially on Wikisource and Commons this is really needed because there are only a few people working on a huge ammount of Wikitext. Therefore, it would be nice to see some activity on this issue.

Thanks for enquiring. Discovery (the primary maintainer of CirrusSearch) is actually working on CirrusSearch quite a lot right now, but our primary focus is on readers as they have historically been served incredibly poorly by our search. Due to resource constraints, we are unlikely to shift this priority any time soon. Sorry for the inconvenience.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 2 2016, 9:16 AM
Jonesey95 added a comment.EditedNov 3 2016, 5:36 AM

Just another example, if it helps to find a solution to this problem.

Currently, searching for this regex on en.WP, in the article namespace, returns 9 articles:

insource:/cite [^\}]*\| *city *= *[a-z]+/i

Searching for this string, which should be a subset of the above search, finds 301 articles:

insource:/cite [^\}]*\|city= *[a-z]+/i

The commenter above is correct that when insource searches work correctly, a few editors are able to make quick improvements to thousands of articles, which improves readers' experience.

Can we at least have an error message stating when an insource search is showing incomplete results, with a link to a Help page? Saying that there are only 9 results for the first search, without returning any sort of error, is misleading. Returning an error message is suggested in T134157.

Elitre added a subscriber: Elitre.Jan 19 2017, 5:58 PM
Kotz added a subscriber: Kotz.Jan 22 2017, 5:22 PM
He7d3r added a subscriber: He7d3r.Feb 12 2017, 7:52 PM
dcausse closed this task as Resolved.Apr 4 2017, 12:53 PM
dcausse claimed this task.

marking as resolved because we now display a warning when the search timed-out.
We have more control on the time allowed by insource (20sec per server), we could adjust it in the future with higher values if the underlying infrastructure can support it.