Page MenuHomePhabricator

CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]]
Open, LowPublic

Description

What it says in the summary. :-)


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=33824

Details

Reference
bz61080

Event Timeline

bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz61080.
Deskana created this task.Feb 8 2014, 1:54 AM

Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice. I wonder if I should introduce another search just against the lowercased version of the title that should help boost things like:
Symbol page titles like this,
Exact, in order, title matches

Change 112566 had a related patch set uploaded by Manybubbles:
Use near_match to also search pages

https://gerrit.wikimedia.org/r/112566

Did what I said about adding an extra analyzer. It help. Note that intitle:¢ won't find ¢ because ¢ is a redirect.

Change 112566 merged by jenkins-bot:
Use near_match to also search pages

https://gerrit.wikimedia.org/r/112566

EBernhardson reopened this task as Open.Dec 7 2016, 4:31 PM
EBernhardson added a project: good first bug.
EBernhardson added a subscriber: EBernhardson.

intitle: currently only hits the title field, should update to do redirect.title as well

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptDec 7 2016, 4:31 PM
Restricted Application added a subscriber: TerraCodes. · View Herald Transcript
EBernhardson removed Manybubbles as the assignee of this task.Dec 7 2016, 4:31 PM
EBernhardson set Security to None.
Deskana lowered the priority of this task from Normal to Low.
demon removed a subscriber: demon.Feb 7 2017, 5:53 AM

intitle now queries the redirect titles, but this bug is still not fixed. It looks like the analyzers throw away this token:

GET /enwiki_content/_analyze
{ "text": "¢", "analyzer": "plain" }

The results: {"tokens":[]}

Same for text, short_text, plain, plain_search. Maybe others. @TJones We probably, at some point, need to look into the english analysis chain and see what other tokens are being thrown away in our plain search.

Bleh. It looks like that symbol is turned into a text boundary by the standard analyzer which isn't nice.

There's your problem! This is also the problem that prompted T211824: Investigate a “rare-character” index. The tokenizer, probably as a too-clever shortcut, treats a lot of interesting non-text characters like they are punctuation and tosses them.

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [
      {
        "name" : "word_break_helper",
        "filtered_text" : [
          "¢"
        ]
      },
      {
        "name" : "kana_map",
        "filtered_text" : [
          "¢"
        ]
      }
    ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ ]   <---------RIGHT THERE!!
    },
    "tokenfilters" : [
      {
        "name" : "aggressive_splitting",
        "tokens" : [ ]
      },
...

We could look into using some other tokenizer (maybe the ICU tokenizer? I'd have to check), but it's not going to be trivial, and for other languages we'd have to unpack their analyzer before we could swap out the tokenizer, so it's a big mess.

Wow, i didn't realize we threw away so many interesting tokens. Unfortunate, but seems this task can become a child of the other to be considered "some day".