Page MenuHomePhabricator

Searching for insource:tag finds <tag> but not {{#tag:tag}}
Closed, ResolvedPublic

Description

Compare two searches: insource:somebug vs. https://www.mediawiki.org/w/index.php?title=Special:Search&profile=all&fulltext=Search&search=insource%3Atag%3Asomebug insource:tag:somebug. One cacthes <somebug> while another catches {{#tag:somebug}}, however intuitively it would seem that the former should catch both cases. Workaround: insource/somebug/, has its own problems.

Event Timeline

MaxSem created this task.Sep 7 2016, 10:13 PM
Restricted Application added a project: Discovery. · View Herald TranscriptSep 7 2016, 10:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Low priority.Sep 8 2016, 10:22 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt added a subscriber: dcausse.

It looks like the task title says ; (semi-colon) but the description says : (colon).

MaxSem renamed this task from Searching for insource:tag finds <tag> but not {{#tag;tag}} to Searching for insource:tag finds <tag> but not {{#tag:tag}}.Sep 8 2016, 10:33 PM

Basically, what's happening here is that the insource: query runs against source_text.plain field. This field is using the "plain_search" analyzer, defined as follows:

"plain_search": {
  "tokenizer": "standard",
  "filter": [
    "standard",
    "icu_normalizer"
  ],
  "char_filter": [
    "word_break_helper"
  ],
  "type": "custom"
},

This analysis chain does not split on :, so {{#tag:foo}} gets indexed as the term tag:foo and not two separate terms, tag and foo. One option to fix this would be to add the aggressive_splitting filter to plain_search, but i'm not sure what the cascading effects of that would be.

@dcausse any thoughts?

Examples:

Finding the value, using the regex source filter:
https://it.wikivoyage.org/w/index.php?search=insource%3A%2Fslippymap%2F+prefix%3ATemplate%3A

Not finding the term, using the standard source filter:
https://it.wikivoyage.org/w/index.php?search=insource%3Aslippymap+prefix%3ATemplate%3A

The relevant terms elasticsearch turned that field into can be queried from inside the cluster using:

curl localhost:9200/itwikivoyage_general/page/7852/_termvectors?fields=source_text.plain | jq '.term_vectors["source_text.plain"].terms | to_entries | map(.key)' | grep slippymap

This results in only one relevant token:

tag:slippymap

@EBernhardson I was not aware of the aggressive_splitting filter (based on the word delimiter filter) and imo it looks very promising (even for other usecases like acronyms).

in the case of insource we really should analyze the text in a way that's intuitive for insource users. One of the drawbacks of more aggressive splitting is that we need to rely on term positions (phrase query) if the user wants to search a token that has been splitted.
For example aggressive_splitting seems to split on camelCase words so when searching for a javascript method helloWorld we will have to use auto_generate_phrase_queries from QueryString (and ideally I'd like to avoid everything that's dependent on QueryString).

One solution could be to try preserve_original which would emit the unmodified token at index time.

I'd suggest to evaluate all the options of the word delimiter token filter and figure out if they are relevant for insource:

  • split_on_case_change (default true): could cause an issue with camel cased function name
  • split_on_numerics (default true): could cause an issue when searching for variable names?
  • stem_english_possessive (default true): I don't think we should enable anything that's language specific in inource

So maybe word token filter is too aggressive for insource search? If that's the case or if we don't want to evaluate all the consequences we can also use a simple word breaker helper (mapping filter) on ':' that would specifically fix this issue.

I suppose taking into acount the amount of work involved, and how we would like to get this in before the bm25 reindex so it doesn't need it's own reindex, adding : to the word break helper is easy, and solves this specific ticket. Sounds good tome.

OK, I'll add another analyzer for this field to make sure it does not affect normal searches.
Also I wonder if we should not tweak source_text to only index source_text.plain we seem to use only this one and never the one analyzed by text.
This might save some space.

dcausse claimed this task.Sep 12 2016, 11:48 AM
dcausse moved this task from Up Next to Current work on the Discovery-Search board.

Change 309974 had a related patch set uploaded (by DCausse):
Break words on semicolon for source_text.plain

https://gerrit.wikimedia.org/r/309974

Change 309974 merged by jenkins-bot:
Break words on semicolon for source_text.plain

https://gerrit.wikimedia.org/r/309974

debt closed this task as Resolved.Sep 23 2016, 9:04 PM