Page MenuHomePhabricator

Analyzer treats an unspaced colon as a letter
Closed, ResolvedPublic

Description

Because CirrusSearch does not always index words separated solely by a colon, such words are hardly searchable. An exact phrase search or an insource search cannot find the words at all, they can only find the one token that is the two words plus the colon. Because it treats an unspaced colon as an alphanumeric character, the following common use cases cannot in general be searched in wikitext: file links, template usage, namespace or interwiki linkage, parser function usage and categories usage. Anything with a colon after its name is not indexed unless the optional space is put after it.

For example, the wikitext [[special:preferences]] will not index special or preferences, and so those words cannot be found in this sandbox:

Neither the insource nor the searches with quotation marks (exact phrase searches) ever found the words.

When the option to not-use the space is taken, as usual, for example {{namespace:pagename}} (instead of the perfectly valid
{{namespace: pagename}}), those two words becomes lost because of the analyzer. For example, File: is lost in File:siamese cat, which becomes as if to any general search query equal to filesiamese. So the following class of questions are unanswerable, Where are any file links? namespace links? interwiki links? external links built by parser functions (T121379). Any parser function usage is in the dark, for example "Where is urlencode used?" Insource cannot say because the unspaced colon morphs the word away.

Existing usage on the wiki, of files, templates, namespaces, and parser functions is in the dark unless 1) we run bare regular expressions (bare, meaning no regex filter, no possible insource filter to provide the indexed-search-provided search domain) 2) we ask for new parameters like hasfile: and hasparserfunction: 3) we use external tools. IMHO, none of these are advisable, but we must advise the workaround "run bare regexp".

Insource is especially missed:

  • No finding counts or existence usage of file:pagename or urlencode:url, category:pagename, namespace:pagename, template:pagename.
  • Regex searches are missing their ideal companion filter, so most Search magic is snuffed out.
  • One may not answer simple questions like "Do any incategory:dogs lack insource:[[files?"
  • Cannot find external links that use parser functions

Event Timeline

Cpiral raised the priority of this task from to Needs Triage.
Cpiral updated the task description. (Show Details)
Cpiral added a project: CirrusSearch.
Cpiral subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Actually, there's an insource-wildcard workaround

Unless there is a good reason for it as an insource- and "exact search"-only feature, then it seems to me like this colon character is an odd bug with a significant drag on Search functionality.

Deskana claimed this task.
Deskana subscribed.

I retried some of the queries you listed and it seems to work now, so I'm closing this as resolved. Most of the fixes should be available very soon as soon as BM25 is ready.

Regarding the other functionality, such as * insource: "special", we'll have a think about that. It's arguably intended behaviour that it returns nothing.