Page MenuHomePhabricator

Search for unicode symbols like ★ is inconsistent and unpredictable
Open, LowestPublic

Description

[Note: fixed phab syntax for search links, added intitle regex example.]

Event Timeline

TheDJ created this task.Apr 12 2015, 10:25 AM
TheDJ updated the task description. (Show Details)
TheDJ raised the priority of this task from to Needs Triage.
TheDJ added a project: CirrusSearch.
TheDJ added a subscriber: TheDJ.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2015, 10:25 AM
Jdouglas triaged this task as Normal priority.Apr 14 2015, 6:38 PM
Jdouglas set Security to None.
Jdouglas claimed this task.

I wonder if this is caused by the filtering on the current analyzers.

Submitting match queries directly, I only get results when using the all_near_match field -- I get no results for all, title, text, etc.

{
  "query": {
    "multi_match" : {
      "query": "✭",
      "fields": [
        "title.plain",
        "all",
        "all.plain",
        "text.plain",
        "text"
      ] 
    }
  }
}

For reference, here's the mapping for enwiki:

{
  "enwiki_content_1415916568": {
    "mappings": {
      "namespace": {
        "dynamic": "false",
        "_all": {
          "enabled": false
        },
        "properties": {
          "name": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "near_match_asciifolding",
            "ignore_above": 5000
          }
        }
      },
      "page": {
        "dynamic": "false",
        "_all": {
          "enabled": false
        },
        "properties": {
          "all": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "all_near_match": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "freqs",
            "analyzer": "near_match",
            "fields": {
              "asciifolding": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "freqs",
                "analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "near_match"
          },
          "auxiliary_text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "category": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "lowercase_keyword": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "docs",
                "analyzer": "lowercase_keyword",
                "position_offset_gap": 10,
                "search_quote_analyzer": "lowercase_keyword",
                "ignore_above": 5000
              },
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "coordinates": {
            "type": "nested",
            "properties": {
              "coord": {
                "type": "geo_point",
                "lat_lon": true
              },
              "country": {
                "type": "string",
                "index": "not_analyzed"
              },
              "dim": {
                "type": "float"
              },
              "globe": {
                "type": "string",
                "index": "not_analyzed"
              },
              "name": {
                "type": "string",
                "index": "no"
              },
              "primary": {
                "type": "boolean"
              },
              "region": {
                "type": "string",
                "index": "not_analyzed"
              },
              "type": {
                "type": "string",
                "index": "not_analyzed"
              }
            }
          },
          "external_link": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "file_text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "heading": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "incoming_links": {
            "type": "long"
          },
          "language": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "local_sites_with_dupe": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "lowercase_keyword",
            "ignore_above": 5000
          },
          "namespace": {
            "type": "long"
          },
          "namespace_text": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "opening_text": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "outgoing_link": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "redirect": {
            "dynamic": "false",
            "properties": {
              "namespace": {
                "type": "long"
              },
              "title": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "text",
                "search_analyzer": "text_search",
                "fields": {
                  "keyword": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "keyword",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "keyword"
                  },
                  "near_match": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "near_match",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match"
                  },
                  "near_match_asciifolding": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "near_match_asciifolding",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match_asciifolding"
                  },
                  "plain": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "offsets",
                    "index_analyzer": "plain",
                    "search_analyzer": "plain_search",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "plain_search"
                  },
                  "prefix": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "index_analyzer": "prefix",
                    "search_analyzer": "near_match",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match"
                  },
                  "prefix_asciifolding": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "index_analyzer": "prefix_asciifolding",
                    "search_analyzer": "near_match_asciifolding",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match_asciifolding"
                  },
                  "suggest": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "analyzer": "suggest",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "suggest"
                  }
                },
                "copy_to": [
                  "suggest",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match"
                ],
                "position_offset_gap": 10,
                "search_quote_analyzer": "text_search"
              }
            }
          },
          "source_text": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "trigram": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "docs",
                "analyzer": "trigram",
                "position_offset_gap": 10,
                "search_quote_analyzer": "trigram"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "suggest": {
            "type": "string",
            "analyzer": "suggest"
          },
          "template": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "lowercase_keyword",
            "ignore_above": 5000
          },
          "text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "word_count": {
                "type": "token_count",
                "store": true,
                "analyzer": "plain"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "text_bytes": {
            "type": "long",
            "index": "no"
          },
          "timestamp": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "title": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "keyword": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "keyword",
                "position_offset_gap": 10,
                "search_quote_analyzer": "keyword"
              },
              "near_match": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "near_match",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match"
              },
              "near_match_asciifolding": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              },
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "prefix": {
                "type": "string",
                "index_options": "docs",
                "index_analyzer": "prefix",
                "search_analyzer": "near_match",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match"
              },
              "prefix_asciifolding": {
                "type": "string",
                "index_options": "docs",
                "index_analyzer": "prefix_asciifolding",
                "search_analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              },
              "suggest": {
                "type": "string",
                "analyzer": "suggest",
                "position_offset_gap": 10,
                "search_quote_analyzer": "suggest"
              }
            },
            "copy_to": [
              "suggest",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "wikibase_item": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          }
        }
      }
    }
  }
}
Jdouglas added a comment.EditedApr 17 2015, 10:10 PM

Maybe it's being character filtered:

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=text -d '✭'
{"tokens":[]}
$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=text -d '✭NSYNC'
{"tokens":[{"token":"nsync","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1}]}

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=near_match -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?field=all -d '✭'{"tokens":[]}
$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?field=all_near_match -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

The whitespace analyzer doesn't mind:

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=whitespace -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}
Manybubbles moved this task from Needs triage to Search on the Discovery board.May 7 2015, 7:51 PM
ksmith removed Jdouglas as the assignee of this task.May 15 2015, 5:32 PM
Jdouglas added a comment.EditedMay 28 2015, 4:31 PM

Blocked: the browser tests are super broken:

384 scenarios (378 failed, 6 skipped)
1219 steps (21 failed, 943 skipped, 255 passed)

Took 7326.30138746 seconds
Jdouglas removed Jdouglas as the assignee of this task.Jun 11 2015, 5:18 PM
Deskana lowered the priority of this task from Normal to Lowest.Dec 3 2015, 5:45 PM
Deskana added a subscriber: Deskana.

Searching for unicode symbols is likely pretty rare, so we just can't prioritise this right now.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 13 2016, 11:36 AM
Restricted Application added a subscriber: Luke081515. · View Herald Transcript
TJones updated the task description. (Show Details)Dec 17 2018, 2:59 PM
TJones added a subscriber: TJones.Dec 17 2018, 3:06 PM

It is indeed inconsistent, but it is predictable if you have spent waaaaaay too much time digging into all this. The short version is that the standard tokenizer—which breaks text into words—used by most analyzers (for languages with spaces) for regular search, treats "symbols" like ★ as non-word characters. It's the stuff between words, like whitespace and punctuation.

The simple ★ search gets one result because we also check for exact title matches. This is how we get matches on punctuation and other symbols.

The insource and intitle searches don't match anything because they analyze the text and when they are done, nothing is left.

The regex searches don't get all the results (or consistent results) because they time out before being able to do a full scan of the index looking for the one character. The regex acceleration needs trigrams to work with in order to find anything.

See more at T211824: Investigate a “rare-character” index.