Page MenuHomePhabricator

Search for unicode symbols like ★ is inconsistent and unpredictable
Open, LowestPublic

Description

[Note: fixed phab syntax for search links, added intitle regex example.]

Event Timeline

TheDJ raised the priority of this task from to Needs Triage.
TheDJ updated the task description. (Show Details)
TheDJ added a project: CirrusSearch.
TheDJ added a subscriber: TheDJ.
Jdouglas set Security to None.

I wonder if this is caused by the filtering on the current analyzers.

Submitting match queries directly, I only get results when using the all_near_match field -- I get no results for all, title, text, etc.

{
  "query": {
    "multi_match" : {
      "query": "✭",
      "fields": [
        "title.plain",
        "all",
        "all.plain",
        "text.plain",
        "text"
      ] 
    }
  }
}

For reference, here's the mapping for enwiki:

{
  "enwiki_content_1415916568": {
    "mappings": {
      "namespace": {
        "dynamic": "false",
        "_all": {
          "enabled": false
        },
        "properties": {
          "name": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "near_match_asciifolding",
            "ignore_above": 5000
          }
        }
      },
      "page": {
        "dynamic": "false",
        "_all": {
          "enabled": false
        },
        "properties": {
          "all": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "all_near_match": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "freqs",
            "analyzer": "near_match",
            "fields": {
              "asciifolding": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "freqs",
                "analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "near_match"
          },
          "auxiliary_text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "category": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "lowercase_keyword": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "docs",
                "analyzer": "lowercase_keyword",
                "position_offset_gap": 10,
                "search_quote_analyzer": "lowercase_keyword",
                "ignore_above": 5000
              },
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "coordinates": {
            "type": "nested",
            "properties": {
              "coord": {
                "type": "geo_point",
                "lat_lon": true
              },
              "country": {
                "type": "string",
                "index": "not_analyzed"
              },
              "dim": {
                "type": "float"
              },
              "globe": {
                "type": "string",
                "index": "not_analyzed"
              },
              "name": {
                "type": "string",
                "index": "no"
              },
              "primary": {
                "type": "boolean"
              },
              "region": {
                "type": "string",
                "index": "not_analyzed"
              },
              "type": {
                "type": "string",
                "index": "not_analyzed"
              }
            }
          },
          "external_link": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "file_text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "heading": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "incoming_links": {
            "type": "long"
          },
          "language": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "local_sites_with_dupe": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "lowercase_keyword",
            "ignore_above": 5000
          },
          "namespace": {
            "type": "long"
          },
          "namespace_text": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "opening_text": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "outgoing_link": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "redirect": {
            "dynamic": "false",
            "properties": {
              "namespace": {
                "type": "long"
              },
              "title": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "text",
                "search_analyzer": "text_search",
                "fields": {
                  "keyword": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "keyword",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "keyword"
                  },
                  "near_match": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "near_match",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match"
                  },
                  "near_match_asciifolding": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "near_match_asciifolding",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match_asciifolding"
                  },
                  "plain": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "offsets",
                    "index_analyzer": "plain",
                    "search_analyzer": "plain_search",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "plain_search"
                  },
                  "prefix": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "index_analyzer": "prefix",
                    "search_analyzer": "near_match",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match"
                  },
                  "prefix_asciifolding": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "index_analyzer": "prefix_asciifolding",
                    "search_analyzer": "near_match_asciifolding",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match_asciifolding"
                  },
                  "suggest": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "analyzer": "suggest",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "suggest"
                  }
                },
                "copy_to": [
                  "suggest",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match"
                ],
                "position_offset_gap": 10,
                "search_quote_analyzer": "text_search"
              }
            }
          },
          "source_text": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "trigram": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "docs",
                "analyzer": "trigram",
                "position_offset_gap": 10,
                "search_quote_analyzer": "trigram"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "suggest": {
            "type": "string",
            "analyzer": "suggest"
          },
          "template": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "lowercase_keyword",
            "ignore_above": 5000
          },
          "text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "word_count": {
                "type": "token_count",
                "store": true,
                "analyzer": "plain"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "text_bytes": {
            "type": "long",
            "index": "no"
          },
          "timestamp": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "title": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "keyword": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "keyword",
                "position_offset_gap": 10,
                "search_quote_analyzer": "keyword"
              },
              "near_match": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "near_match",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match"
              },
              "near_match_asciifolding": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              },
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "prefix": {
                "type": "string",
                "index_options": "docs",
                "index_analyzer": "prefix",
                "search_analyzer": "near_match",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match"
              },
              "prefix_asciifolding": {
                "type": "string",
                "index_options": "docs",
                "index_analyzer": "prefix_asciifolding",
                "search_analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              },
              "suggest": {
                "type": "string",
                "analyzer": "suggest",
                "position_offset_gap": 10,
                "search_quote_analyzer": "suggest"
              }
            },
            "copy_to": [
              "suggest",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "wikibase_item": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          }
        }
      }
    }
  }
}

Maybe it's being character filtered:

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=text -d '✭'
{"tokens":[]}
$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=text -d '✭NSYNC'
{"tokens":[{"token":"nsync","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1}]}

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=near_match -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?field=all -d '✭'{"tokens":[]}
$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?field=all_near_match -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

The whitespace analyzer doesn't mind:

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=whitespace -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

Blocked: the browser tests are super broken:

384 scenarios (378 failed, 6 skipped)
1219 steps (21 failed, 943 skipped, 255 passed)

Took 7326.30138746 seconds
Deskana lowered the priority of this task from Medium to Lowest.Dec 3 2015, 5:45 PM
Deskana added a subscriber: Deskana.

Searching for unicode symbols is likely pretty rare, so we just can't prioritise this right now.

Restricted Application added a subscriber: Luke081515. · View Herald Transcript

It is indeed inconsistent, but it is predictable if you have spent waaaaaay too much time digging into all this. The short version is that the standard tokenizer—which breaks text into words—used by most analyzers (for languages with spaces) for regular search, treats "symbols" like ★ as non-word characters. It's the stuff between words, like whitespace and punctuation.

The simple ★ search gets one result because we also check for exact title matches. This is how we get matches on punctuation and other symbols.

The insource and intitle searches don't match anything because they analyze the text and when they are done, nothing is left.

The regex searches don't get all the results (or consistent results) because they time out before being able to do a full scan of the index looking for the one character. The regex acceleration needs trigrams to work with in order to find anything.

See more at T211824: Investigate a “rare-character” index.