In de.wikipedia, searching for insource:/[0-9]°C/ gives only 3 results, while searching for insource:/0°C/ gives 442.
|Resolved||dcausse||T106685 Regexes in search queries can sometimes return fewer search results than they should|
|Resolved||EBernhardson||T134157 Use builtin timeout support instead of max_inspect for insource queries|
|Resolved||EBernhardson||T149142 Create messages for users when their search using advanced syntax doesn't return stuff|
|Resolved||Deskana||T103289 CirrusSearch: More search results when narrowing down search term|
- Mentioned In
- T73098: Search using insource and regex returns irregular and different set of articles each time
T127788: mwgrep and "insource:" search is missing lots of pages in its index
T103289: CirrusSearch: More search results when narrowing down search term
- Mentioned Here
- T134157: Use builtin timeout support instead of max_inspect for insource queries
I'm just adding a technical note to record the current state of my investigations concerning this issue.
insource regex uses a 2 pass technique. The first pass will try to speedup the query by finding documents with a 3-gram index. The second pass will run the regex on the matching docs (the recheck phase).
I think this query triggers multiple insource limitations in the first pass:
- the regular expression must contains at least 3 contiguous characters in order to run over the 3-gram index
- a character range won't be expanded if it matches more than 4 characters, here [0-9] should be expanded as 10 characters but the limitation will prevent this expansion.
There is two workarounds but none of them will satisfy the original query.
For example :
- we can run the insource:/[0-9]°C / with a trailing space which will allow to workaround the first limitation but would only brings results where C is followed by a space.
- we can limit the expansion to 4 chars by running multiple queries:
In order to correctly fix this issue we will have to increase the limits (maxExpand param which defaults to 4). Unfortunately these limits are here to guarantee that a query cannot hurt the global performance of the system.
If the insource is not able to run with the accelerated first pass it will inspect only 10000 pages and then stop. This should explain why this query returns only 2 or 3 results, only 2 or 3 pages matched the regex within the first 10000 pages inspected.
Although this issue has been rated as "low priority" i just want to mention that the CirrusSearch is important for many people that want to improve wikitext by identifying and working on systematic errors. Especially on Wikisource and Commons this is really needed because there are only a few people working on a huge ammount of Wikitext. Therefore, it would be nice to see some activity on this issue.
Thanks for enquiring. Discovery (the primary maintainer of CirrusSearch) is actually working on CirrusSearch quite a lot right now, but our primary focus is on readers as they have historically been served incredibly poorly by our search. Due to resource constraints, we are unlikely to shift this priority any time soon. Sorry for the inconvenience.
Just another example, if it helps to find a solution to this problem.
Currently, searching for this regex on en.WP, in the article namespace, returns 9 articles:
insource:/cite [^\}]*\| *city *= *[a-z]+/i
Searching for this string, which should be a subset of the above search, finds 301 articles:
insource:/cite [^\}]*\|city= *[a-z]+/i
The commenter above is correct that when insource searches work correctly, a few editors are able to make quick improvements to thousands of articles, which improves readers' experience.
Can we at least have an error message stating when an insource search is showing incomplete results, with a link to a Help page? Saying that there are only 9 results for the first search, without returning any sort of error, is misleading. Returning an error message is suggested in T134157.
marking as resolved because we now display a warning when the search timed-out.
We have more control on the time allowed by insource (20sec per server), we could adjust it in the future with higher values if the underlying infrastructure can support it.