The term search index currently uses on-the-fly conversion to utf8 (and then lower case) to perform comparisons. That means a full table scan followed by a file sort on a table that is likely to contain several dozen million rows. That's likely to kill the DB server.
To avoid this, there should be a dedicated search key column holding the normalized key (similar to the way a search key column is used for category sorting and finding external links). The same normalization shall apply to the index term when inserted and the search term when generating the query. In particular, the following normalization shall apply:
- unicode normalization (NFC)
- trim leading and trailing whitespace (ideally, all unicode whitespace chars)
- lower case (ideally, using the implementation from the appropriate Language class).
- optionally, apply a configurable regular expression for stripping separators (e.g. per default stripping all internal whitespace and hyphens, so "foobar" will match "foo-bar" and "foo bar").
This will provide case-insensitive matches with some flexibility regarding whitespace, etc. If only exact matches are desired, the "soft" result could be filtered programmatically before returning it to the caller.
Version: unspecified
Severity: critical