Page MenuHomePhabricator

Search for unicode symbols like ★ is inconsistent and unpredictable
Closed, ResolvedPublic

Description

[Note: fixed phab syntax for search links, added intitle regex example.]

Event Timeline

TheDJ raised the priority of this task from to Needs Triage.
TheDJ updated the task description. (Show Details)
TheDJ added a project: CirrusSearch.
TheDJ subscribed.
Jdouglas set Security to None.

I wonder if this is caused by the filtering on the current analyzers.

Submitting match queries directly, I only get results when using the all_near_match field -- I get no results for all, title, text, etc.

{
  "query": {
    "multi_match" : {
      "query": "✭",
      "fields": [
        "title.plain",
        "all",
        "all.plain",
        "text.plain",
        "text"
      ] 
    }
  }
}

For reference, here's the mapping for enwiki:

{
  "enwiki_content_1415916568": {
    "mappings": {
      "namespace": {
        "dynamic": "false",
        "_all": {
          "enabled": false
        },
        "properties": {
          "name": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "near_match_asciifolding",
            "ignore_above": 5000
          }
        }
      },
      "page": {
        "dynamic": "false",
        "_all": {
          "enabled": false
        },
        "properties": {
          "all": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "all_near_match": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "freqs",
            "analyzer": "near_match",
            "fields": {
              "asciifolding": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "freqs",
                "analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "near_match"
          },
          "auxiliary_text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "category": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "lowercase_keyword": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "docs",
                "analyzer": "lowercase_keyword",
                "position_offset_gap": 10,
                "search_quote_analyzer": "lowercase_keyword",
                "ignore_above": 5000
              },
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "coordinates": {
            "type": "nested",
            "properties": {
              "coord": {
                "type": "geo_point",
                "lat_lon": true
              },
              "country": {
                "type": "string",
                "index": "not_analyzed"
              },
              "dim": {
                "type": "float"
              },
              "globe": {
                "type": "string",
                "index": "not_analyzed"
              },
              "name": {
                "type": "string",
                "index": "no"
              },
              "primary": {
                "type": "boolean"
              },
              "region": {
                "type": "string",
                "index": "not_analyzed"
              },
              "type": {
                "type": "string",
                "index": "not_analyzed"
              }
            }
          },
          "external_link": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "file_text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "heading": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "incoming_links": {
            "type": "long"
          },
          "language": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "local_sites_with_dupe": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "lowercase_keyword",
            "ignore_above": 5000
          },
          "namespace": {
            "type": "long"
          },
          "namespace_text": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "opening_text": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              }
            },
            "copy_to": [
              "all",
              "all",
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "outgoing_link": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          },
          "redirect": {
            "dynamic": "false",
            "properties": {
              "namespace": {
                "type": "long"
              },
              "title": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "offsets",
                "index_analyzer": "text",
                "search_analyzer": "text_search",
                "fields": {
                  "keyword": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "keyword",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "keyword"
                  },
                  "near_match": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "near_match",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match"
                  },
                  "near_match_asciifolding": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "analyzer": "near_match_asciifolding",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match_asciifolding"
                  },
                  "plain": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "offsets",
                    "index_analyzer": "plain",
                    "search_analyzer": "plain_search",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "plain_search"
                  },
                  "prefix": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "index_analyzer": "prefix",
                    "search_analyzer": "near_match",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match"
                  },
                  "prefix_asciifolding": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "index_options": "docs",
                    "index_analyzer": "prefix_asciifolding",
                    "search_analyzer": "near_match_asciifolding",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "near_match_asciifolding"
                  },
                  "suggest": {
                    "type": "string",
                    "norms": {
                      "enabled": false
                    },
                    "analyzer": "suggest",
                    "position_offset_gap": 10,
                    "search_quote_analyzer": "suggest"
                  }
                },
                "copy_to": [
                  "suggest",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match",
                  "all_near_match"
                ],
                "position_offset_gap": 10,
                "search_quote_analyzer": "text_search"
              }
            }
          },
          "source_text": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "trigram": {
                "type": "string",
                "norms": {
                  "enabled": false
                },
                "index_options": "docs",
                "analyzer": "trigram",
                "position_offset_gap": 10,
                "search_quote_analyzer": "trigram"
              }
            },
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "suggest": {
            "type": "string",
            "analyzer": "suggest"
          },
          "template": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "lowercase_keyword",
            "ignore_above": 5000
          },
          "text": {
            "type": "string",
            "index_options": "offsets",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "plain": {
                "type": "string",
                "index_options": "offsets",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "word_count": {
                "type": "token_count",
                "store": true,
                "analyzer": "plain"
              }
            },
            "copy_to": [
              "all"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "text_bytes": {
            "type": "long",
            "index": "no"
          },
          "timestamp": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "title": {
            "type": "string",
            "index_analyzer": "text",
            "search_analyzer": "text_search",
            "fields": {
              "keyword": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "keyword",
                "position_offset_gap": 10,
                "search_quote_analyzer": "keyword"
              },
              "near_match": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "near_match",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match"
              },
              "near_match_asciifolding": {
                "type": "string",
                "index_options": "docs",
                "analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              },
              "plain": {
                "type": "string",
                "index_analyzer": "plain",
                "search_analyzer": "plain_search",
                "position_offset_gap": 10,
                "search_quote_analyzer": "plain_search"
              },
              "prefix": {
                "type": "string",
                "index_options": "docs",
                "index_analyzer": "prefix",
                "search_analyzer": "near_match",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match"
              },
              "prefix_asciifolding": {
                "type": "string",
                "index_options": "docs",
                "index_analyzer": "prefix_asciifolding",
                "search_analyzer": "near_match_asciifolding",
                "position_offset_gap": 10,
                "search_quote_analyzer": "near_match_asciifolding"
              },
              "suggest": {
                "type": "string",
                "analyzer": "suggest",
                "position_offset_gap": 10,
                "search_quote_analyzer": "suggest"
              }
            },
            "copy_to": [
              "suggest",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match",
              "all_near_match"
            ],
            "position_offset_gap": 10,
            "search_quote_analyzer": "text_search"
          },
          "wikibase_item": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "index_options": "docs",
            "analyzer": "keyword",
            "ignore_above": 5000
          }
        }
      }
    }
  }
}

Maybe it's being character filtered:

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=text -d '✭'
{"tokens":[]}
$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=text -d '✭NSYNC'
{"tokens":[{"token":"nsync","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1}]}

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=near_match -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?field=all -d '✭'{"tokens":[]}
$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?field=all_near_match -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

The whitespace analyzer doesn't mind:

$ curl -s -XGET localhost:9900/enwiki_content_1415916568/_analyze?analyzer=whitespace -d '✭'
{"tokens":[{"token":"✭","start_offset":0,"end_offset":1,"type":"word","position":1}]}

Blocked: the browser tests are super broken:

384 scenarios (378 failed, 6 skipped)
1219 steps (21 failed, 943 skipped, 255 passed)

Took 7326.30138746 seconds
Deskana lowered the priority of this task from Medium to Lowest.Dec 3 2015, 5:45 PM
Deskana subscribed.

Searching for unicode symbols is likely pretty rare, so we just can't prioritise this right now.

Restricted Application added a subscriber: Luke081515. · View Herald Transcript

It is indeed inconsistent, but it is predictable if you have spent waaaaaay too much time digging into all this. The short version is that the standard tokenizer—which breaks text into words—used by most analyzers (for languages with spaces) for regular search, treats "symbols" like ★ as non-word characters. It's the stuff between words, like whitespace and punctuation.

The simple ★ search gets one result because we also check for exact title matches. This is how we get matches on punctuation and other symbols.

The insource and intitle searches don't match anything because they analyze the text and when they are done, nothing is left.

The regex searches don't get all the results (or consistent results) because they time out before being able to do a full scan of the index looking for the one character. The regex acceleration needs trigrams to work with in order to find anything.

See more at T211824: Investigate a “rare-character” index.

TJones claimed this task.

I'm going to close this ticket because all of the example queries now do reasonable things (or, the unreasonable parts have to do with regex searches timing out, not Unicode characters).

I checked on a selection of characters mentioned here, in T211824, and in discussions linked from T211824: ☃ ☁ ☀ ☂ 😁 💃 💀 ☥ 〃 〆 £ € ® 🤪 ★ 🌀 🙏 🚀 🛳 ☀ ☄ ☇ ♬ ♰ ✒ ✙ ➿ ☎

Using the ICU tokenizer (which is now standard everywhere), all of those are indexed except 〃 £ € ® ✙

The standard tokenizer (only used in a few places) indexes all of them except 〃 £ € ✙ (i.e., ® makes it through)

I'm not sure about ✙, but the others I would consider either punctuation or "punctuation-adjacent".

This is a big enough improvement in general and specifically for ★ and similar characters that I think this is no longer the same kind of issue it once was.