TL;DR: Look into how expensive it would be to create a document-level index for individual "rare" characters, which would allow users to find documents that contain rarer characters that are usually ignored when indexing. (n.b.: Normal punctuation characters would still not be easily searchable.)
As discussed in this Extension:CirrusSearch discussion topic, certain characters are extremely hard to find on-wiki, for example ☥ (Ankh), 〃 (ditto mark), and 〆 (ideographic closing mark). Only articles with exact title or redirect matches are shown.
The problem is that these and similar characters are often treated as punctuation by the tokenizer, or converted into something else by the normalization process. When the tokenizer consumes the characters (☥ is treated as punctuation for some reason, for example), even searching for them in quotes (i.e., using the plain field) or using regular insource won’t find them.
Searching using insource regexes (e.g., insource:/☥/ will find some instances, but it is slow, CPU intensive, and single letter queries are guaranteed to time out on larger wikis (we need trigrams for the regex acceleration to work).
A solution proposed in the discussion page was an analyzer that did whitespace tokenization only. This would work for some use cases, but might be confusing and would create a very large index that would not be used by many people.
An alternate proposal that @dcausse and I discussed would involve creating a custom tokenizer that would create a token for individual characters, but ignore the most common characters, and only index the characters at the document level.
- Do we index the raw source of the document, or the version readers see?
- Do we index just the text of the document, or also the auxiliary text and other transcluded text?
- The exact set of characters to ignore is unclear, but for English, French, or Russian, a first pass would be that any character with it’s own key on a standard keyboard for the language would be ignored—that’s all the letters, numbers, space, and common punctuation.
- For English, this is still ambiguous because £ is on a standard UK keyboard, but not a standard US keyboard—do we do the intersection or the union of major keyboard layouts?
- It is possible (even desirable) that some documents would not be in this index because they have nothing but “boring” characters in them.
- If we index the document source, we’d need to make sure common wiki markup makes it into the set of ignored characters.
Rather than write such a tokenizer, I think we can do a two-step approximation process (the goal of this task):
- Step 1, write a program to simulate the indexing and run it on samples of 1K to100K random documents, and extrapolate the number of unique characters and percentage of documents indexed per character for, say, English Wikipedia.
- Step 2, if the numbers from Step 1 aren’t unreasonable, we can quickly (but inefficiently) approximate the desired behavior in Elasticsearch with character filters to (i) drop the ignored characters and (ii) add spaces after all other characters, and then use the whitespace tokenizer to generate tokens. This may also drop non-standard whitespace characters (thin space, hair space, three-per-em space, etc.), but that’s only a handful of characters affected. We could build an index on RelForge and see how big the resulting index is, and whether it is tenable in production.
If the index from Step 2 looks reasonable, we could then build a tokenizer that does the same thing, but efficiently and correctly, and then build the index, and create a new keyword for it (like char: perhaps).