Page MenuHomePhabricator

insource cannot find repeating words
Closed, ResolvedPublic

Description

There is no way to filter insource:/"<big></big>"/, for example. Filters for regex are very important to have.

The search "big big" will not look insource, so cannot act as a filter. (But it can find repeating words.)

The search insource:"big" and insource:"big big big" are equivalent, and a weak filter.

Unlike "big big big", which uses proximity zero, insource turns ''off'' proximity.

If insource at least had proximity zero, it could find phrases that would greatly increase the filtering needs of regex.

As it is the quotes are misleading, as insource cannot even find two words next to each other. Insource:"big big" is the same as insource:big insource:big. i is the inability to find repeating words.

Maybe it's just a feature request, but we're talking about supporting a rare public tool, regex. These things would run much much faster if they had the ability to filter word phrases.

Insource should at least set proximity to zero.

Event Timeline

Cpiral created this task.Sep 3 2015, 12:37 AM
Cpiral updated the task description. (Show Details)
Cpiral raised the priority of this task from to Needs Triage.
Cpiral added a subscriber: Cpiral.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 3 2015, 12:37 AM
Cpiral updated the task description. (Show Details)Sep 3 2015, 1:43 AM
Cpiral set Security to None.
Krinkle added a subscriber: Krinkle.Sep 3 2015, 2:57 AM

I'm not sure if this is useful, but it seems insource:/"<big><big>"/ does correctly return results where two <big> tags were used next to each other.

insource:/"<big><big>"/ – Wikipedia search

  1. Romanian language .. -size:95%;" |+<big><big>'''..
  2. Torino F.C. ..rowspan=4|<big><big>'''1º'''</big></big> || ''' ..

I think you can re-activate the proximity by escaping the quotes with insource:"\"big big\""

insource:"\"big big\"" (enwiki)
If you are looking for the following pattern <big></big> then the full filtered query should be : insource:/\<big\>\<\/big\>/ insource:"\"big big\""

Cpiral added a comment.Sep 3 2015, 5:37 PM

You're right.

Just saw T110855.

Cpiral added a comment.EditedSep 3 2015, 5:41 PM

I've been operating on the principle that there is no comprehensive, authoritative, reference manual for Cirrus Search. Do I really need to go to phabricator to get answers like this? Thanks.

Krenair added a subscriber: Krenair.Sep 4 2015, 8:16 PM

I consider https://www.mediawiki.org/wiki/Help:CirrusSearch to be the authoritative manual, though maybe it isn't comprehensive.

But I don't think this is working as designed, so authoritative documentation wouldn't cover this. It started behaving this way just recently.

Cpiral added a comment.EditedSep 7 2015, 1:06 AM

I'll watch here for a resolution, thank you.

I thought it was a feature. I maintain search links and search templates and search documentation mostly on Wikipedia, and already updated MediaWiki Help:Search, and some Wikipedia search links. For now the documentation describes escaping the inner quotes.

I'd like to reinstate things as soon as they are resolved. Thanks.

Thanks for your feedback.

I bisected the code and it looks like a patch that breaks the default insource behavior was committed on Aug 20 (https://gerrit.wikimedia.org/r/#/c/207747/)
I have to dig into more details to see if we can revert to the old behavior without introducing nasty side effects mainly because the old behavior was unreliable in some cases.

Change 236544 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Fix incategory, insource and intitle with double quoted values

https://gerrit.wikimedia.org/r/236544

Cpiral added a comment.Sep 9 2015, 8:27 PM

https://gerrit.wikimedia.org/r/236544 says

The problem is that some functions require the double quotes to be present.
In my opinion all these cases should be unified to be similar:
They should accept multi-word queries with spaces and not split these into separate queries.

E.g. hastemplate:"search link" does not "split these queries" into hastemplate:search and hastemplate:link,
and incategory:"History by period" does not "split these queries".

The same behavior should be "unified to be similar" for insource, and intitle.

Today regex queries are very inefficient because queries like

insource:"search link" insource/search link/

produce the search results from being split

insource:search insource:link

Today the /regex/ has to crawl character by character through those myriad unwanted search results.
The workaround currently documented will now be removed from the documentation before queries like

insource:"\"search link\""

fixes themselves into many places, only to be redone later when the fix arrives.

Regexp searches are essentially shut down on Wikipedia. Effects of this bug are

  • 690 thousand pages verses 1445 pages, for insource "search link".
  • 73 thousand verses 75 pages for insource "help desk searches"
  • 117 thousand verses 22 pages for insource "search deletion discussions"

Regexp searches have to crawl character-by-character through each page.
Documenting the bug might be necessary. Can we get a time frame on the code review?

@Deskana can we raise the priority of this bug?

The patch looks good, I just have to make sure that there is no side effect.

Ya can expect users to search for things like /2+122/ or /cat (and dog)/.
Ya can even expect them to put them to follow the simple formula:
:insource:"2+122" insource:/"2+122"/
:insource:"cat (and dog)" insource:/"cat (and dog)"/
Which always works to perfectly and instantly filter.

But we can expect most of them to stop reading the formula for instant and courteous results when they get to
:insource:"\"2+122\"" insource:/"2+122"/
:insource:"\"cat (and dog)\"" insource:/"cat (and dog)"/
escaping inner quotes. Nothing else they learn needs escaping because of the following reason.

There are two entirely different reasons for using regex.
One is for "exact string" match, that turns of metacharacters.
These folks will not bother to comfyi up to escaped quotes for a few weeks or months.

The other type regex is for using the metacharacters.
Far fewer will use metacharacters.
They will be power users,
looking for obsolete template usage and obsolete HTML
so as to use AWB (auto wiki browser) to fix things.
They will find that there is no way to filter there regexp
because only insource can match what insource can match.

Oh they can be creative and use hastemplate and wikitext matches and the like.
But they are also the ones building search links, the template calls that stay there.
These search links will be halted in both production and use.
Production because all produced search links will be obsolete when the bug is fixed.
Use because the old search links are so slow, and even timeout.

And who wants to document Cirrus Search regexp and insource when it doesn't work?
It's been two years and no one has documented it until very recently: myself in an almost-completed draft.

I'm working on a draft for CirrusSearch on Wikipedia and developed templates to help regex development,
such as finding template usage for specific parameters that are wondered about.
It's all on hold until I know what to expect from this triage.

Please raise the priority. Thanks.

If we apply the patch the behavior will be as follow :

  • insource:"2+122" : perform a proximity search (this case was broken)
  • insource:"\"2+122\"": perform a proximity search (escaped double quotes will be ignored)
  • insource:/"2+122"/: perform a regex search matching: 2+122 (double quotes will disable regular expression syntax). This case was not broken.
  • insource:/2\+122/: perform a regex search matching: 2+122 (without double quotes you have to escape all special chars). This case was not broken.
  • insource/\""2+122"\"/: perform a regex matching: "2+122" (if you want to match double quotes you have to escape them). This case was not broken.

Concerning intitle the problem was the same but because title is a short string it was maybe not noticed:

  • intitle:"foo bar": perform a proximity search (this case was broken)
  • intitle:"\"foo bar\"": perform a proximity search (escaped double quotes will be ignored)

Concerning hastemplate and incategory it's different because we do not tokenize any text. If you want to find pages with the template Template:The Golden Lion you must write: hastemplate:"The Golden Lion", hastemplate:"Golden Lion" won't match. There is no problem regarding proximity/non-proximity search, it's always an exact match (case insensitive).

Applying the patch seems to be sane as it reverts insource and intitle with quotes to the previous behavior. The only change will be with escaped double quotes insource:"\"foo bar\"" where before proximity search was disabled but now it will still perform a proximity search.

The only way to perform a non-proximity search with insource and intitle will be to use the following syntax: insource:foo insource:bar.

Studying... ... ... right, all right.

@Deskana can we raise the priority of this bug?

@dcausse Thanks for looking into this. We can increase the priority, but it will have to sit lower in priority than other tasks that have been bumped several times that relate directly to our current goals.

Deskana triaged this task as Normal priority.Sep 15 2015, 2:37 AM

@Deskana thanks, moving to "Needs review" (the patch is ready).

Change 236544 merged by jenkins-bot:
Fix incategory, insource and intitle with double quoted values

https://gerrit.wikimedia.org/r/236544

Deskana closed this task as Resolved.Sep 24 2015, 4:09 AM
Deskana claimed this task.
Deskana moved this task from Done to Resolved on the Discovery-Search (Current work) board.