Page MenuHomePhabricator

Find out why "insource:" ignores greyspace
Closed, ResolvedPublic3 Estimated Story Points


In the Advanced Search form, people should be able to search with insource.
Ideally we would want to allow people to write


in the "Source code contains" field, without having to know anything more about search

Currently, you can use insource: either with a regex, or with text [1]. If you use the text version, it ignores all grey space, i.e. it only considers letters and numbers, all other characters are ignored. This means, that is is not possible to search specifically for parts of the wikitext syntax, such as commentary or a link that started in a specific way.

Find out why that is so.


Event Timeline

Please expand the description, I don't know what "normalization" means in this context.

I assume the idea is to guess if what the user typed is a regular expression or not. This is partially possible, but will misbehave in edge-cases.

Lea_WMDE renamed this task from Search prototype: normalize in_source: input to normalize in_source: input for advanced search.May 30 2017, 4:24 PM
Lea_WMDE updated the task description. (Show Details)

Examples for "breaking" insource:

  • [[Wikipedia:Technische
  • Wunschliste/Umfrage
Lea_WMDE renamed this task from normalize in_source: input for advanced search to Find out why "insource:" ignores greyspace.Aug 2 2017, 7:16 AM
Lea_WMDE updated the task description. (Show Details)

@EBernhardson could you help us with the question, why all greyspaces are ignored?

insource:word and insource:// behaves completely differently in-spite of sharing the same keyword.

insource:word is a keyword to access an inverted index on the source text and if I understand the question properly I could reformulate as:
Why does insource:"[[Bernoulli-Gleichung#Bernoulli" display the same set of result than insource:"Bernoulli Gleichung Bernoulli?
This is because insource:word works like a regular search query, where we tokenize the input query to match it against tokenized terms in the index. Changing the way we tokenize insource:word to make it aware of punctuation characters (greyspaces) is probably not a good solution and would make insource:word quite annoying to use and with no difference with insource://.
In short when punctuation chars need to be discriminent insource:// must be used, for other simpler searches insource:word can be used.

@dcausse thanks for your comment!
The reason why we asked the question is that we assumed that people who want to search in source often want to do to find specific things that are not visible in the text - and a lot of that is expressed with punctuation characters (such as comments, links, headings...).

Using insource:// requires users to know how regexs work and would force them (or us) to escape everything, which is not the greatest option for an advanced search form field. Thus we were wondering if there is anything speaking against including greyspaces in the search. From your comment, my feeling now is that there would be objections to it ;) as I understand now that this would mean reimplementing the way insource:word works generally. Too bad, but thanks for the info!

@Lea_WMDE assuming that the input form escape all regex chars there's no reason not to use insource:// for simple string matching.
The input form could maybe transform an input text [[Bernoulli-Gleichung#Anwendung into insource:/\[\[Bernoulli-Gleichung\#Anwendung/ which should bring the results the user expects?
I have no idea if it's reasonable to do this on your side.

@dcausse yes, that's what we are discussing :)

Lea_WMDE claimed this task.