Compare two searches: insource:somebug vs. https://www.mediawiki.org/w/index.php?title=Special:Search&profile=all&fulltext=Search&search=insource%3Atag%3Asomebug insource:tag:somebug. One cacthes <somebug> while another catches {{#tag:somebug}}, however intuitively it would seem that the former should catch both cases. Workaround: insource/somebug/, has its own problems.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Break words on semicolon for source_text.plain | mediawiki/extensions/CirrusSearch | master | +143 -28 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | EBernhardson | T127788 mwgrep and "insource:" search is missing lots of pages in its index | |||
Resolved | dcausse | T145023 Searching for insource:tag finds <tag> but not {{#tag:tag}} |
Event Timeline
Basically, what's happening here is that the insource: query runs against source_text.plain field. This field is using the "plain_search" analyzer, defined as follows:
"plain_search": { "tokenizer": "standard", "filter": [ "standard", "icu_normalizer" ], "char_filter": [ "word_break_helper" ], "type": "custom" },
This analysis chain does not split on :, so {{#tag:foo}} gets indexed as the term tag:foo and not two separate terms, tag and foo. One option to fix this would be to add the aggressive_splitting filter to plain_search, but i'm not sure what the cascading effects of that would be.
@dcausse any thoughts?
Examples:
Finding the value, using the regex source filter:
https://it.wikivoyage.org/w/index.php?search=insource%3A%2Fslippymap%2F+prefix%3ATemplate%3A
Not finding the term, using the standard source filter:
https://it.wikivoyage.org/w/index.php?search=insource%3Aslippymap+prefix%3ATemplate%3A
The relevant terms elasticsearch turned that field into can be queried from inside the cluster using:
curl localhost:9200/itwikivoyage_general/page/7852/_termvectors?fields=source_text.plain | jq '.term_vectors["source_text.plain"].terms | to_entries | map(.key)' | grep slippymap
This results in only one relevant token:
tag:slippymap
@EBernhardson I was not aware of the aggressive_splitting filter (based on the word delimiter filter) and imo it looks very promising (even for other usecases like acronyms).
in the case of insource we really should analyze the text in a way that's intuitive for insource users. One of the drawbacks of more aggressive splitting is that we need to rely on term positions (phrase query) if the user wants to search a token that has been splitted.
For example aggressive_splitting seems to split on camelCase words so when searching for a javascript method helloWorld we will have to use auto_generate_phrase_queries from QueryString (and ideally I'd like to avoid everything that's dependent on QueryString).
One solution could be to try preserve_original which would emit the unmodified token at index time.
I'd suggest to evaluate all the options of the word delimiter token filter and figure out if they are relevant for insource:
- split_on_case_change (default true): could cause an issue with camel cased function name
- split_on_numerics (default true): could cause an issue when searching for variable names?
- stem_english_possessive (default true): I don't think we should enable anything that's language specific in inource
So maybe word token filter is too aggressive for insource search? If that's the case or if we don't want to evaluate all the consequences we can also use a simple word breaker helper (mapping filter) on ':' that would specifically fix this issue.
I suppose taking into acount the amount of work involved, and how we would like to get this in before the bm25 reindex so it doesn't need it's own reindex, adding : to the word break helper is easy, and solves this specific ticket. Sounds good tome.
OK, I'll add another analyzer for this field to make sure it does not affect normal searches.
Also I wonder if we should not tweak source_text to only index source_text.plain we seem to use only this one and never the one analyzed by text.
This might save some space.
Change 309974 had a related patch set uploaded (by DCausse):
Break words on semicolon for source_text.plain