Page MenuHomePhabricator

Investigate a “rare-character” index
Open, MediumPublic

Description

TL;DR: Look into how expensive it would be to create a document-level index for individual "rare" characters, which would allow users to find documents that contain rarer characters that are usually ignored when indexing. (n.b.: Normal punctuation characters would still not be easily searchable.)


As discussed in this Extension:CirrusSearch discussion topic, certain characters are extremely hard to find on-wiki, for example (Ankh), (ditto mark), and (ideographic closing mark). Only articles with exact title or redirect matches are shown.

The problem is that these and similar characters are often treated as punctuation by the tokenizer, or converted into something else by the normalization process. When the tokenizer consumes the characters ( is treated as punctuation for some reason, for example), even searching for them in quotes (i.e., using the plain field) or using regular insource won’t find them.

Searching using insource regexes (e.g., insource:/☥/ will find some instances, but it is slow, CPU intensive, and single letter queries are guaranteed to time out on larger wikis (we need trigrams for the regex acceleration to work).

A solution proposed in the discussion page was an analyzer that did whitespace tokenization only. This would work for some use cases, but might be confusing and would create a very large index that would not be used by many people.

An alternate proposal that @dcausse and I discussed would involve creating a custom tokenizer that would create a token for individual characters, but ignore the most common characters, and only index the characters at the document level.

Unresolved questions:

  • Do we index the raw source of the document, or the version readers see?
  • Do we index just the text of the document, or also the auxiliary text and other transcluded text?
  • The exact set of characters to ignore is unclear, but for English, French, or Russian, a first pass would be that any character with it’s own key on a standard keyboard for the language would be ignored—that’s all the letters, numbers, space, and common punctuation.
    • For English, this is still ambiguous because £ is on a standard UK keyboard, but not a standard US keyboard—do we do the intersection or the union of major keyboard layouts?
    • It is possible (even desirable) that some documents would not be in this index because they have nothing but “boring” characters in them.
    • If we index the document source, we’d need to make sure common wiki markup makes it into the set of ignored characters.

Rather than write such a tokenizer, I think we can do a two-step approximation process (the goal of this task):

  • Step 1, write a program to simulate the indexing and run it on samples of 1K to100K random documents, and extrapolate the number of unique characters and percentage of documents indexed per character for, say, English Wikipedia.
  • Step 2, if the numbers from Step 1 aren’t unreasonable, we can quickly (but inefficiently) approximate the desired behavior in Elasticsearch with character filters to (i) drop the ignored characters and (ii) add spaces after all other characters, and then use the whitespace tokenizer to generate tokens. This may also drop non-standard whitespace characters (thin space, hair space, three-per-em space, etc.), but that’s only a handful of characters affected. We could build an index on RelForge and see how big the resulting index is, and whether it is tenable in production.

If the index from Step 2 looks reasonable, we could then build a tokenizer that does the same thing, but efficiently and correctly, and then build the index, and create a new keyword for it (like char: perhaps).

Event Timeline

Additional notes from the on-wiki discussion:

  • probably prefer the reader's version of the document (insource handles the document source)
  • auxiliary text and other transcluded text preferred, if feasible

I've opened conversations on English Wikipedia (moved) and Wiktionary, and on Commons.

An interesting set of use cases came up on Wiktionary: searching for control characters, private use area characters, or whitespace characters. Things brings up a few ideas about a new keyword and its implementation:

  • It would be nice to be able to specify characters by number (e.g., \u2002 or U+2002 for an 'en space').
  • It would be helpful to be able to specify a range of characters with an implicit OR (e.g., \u2002-\u200D for common whitespace characters).
  • Support for ranges has to be limited because a search for the whole Supplementary Private Use Area-A (\uF0000-\uFFFFF) would kick off ~65K term searches on the back end.

We could discuss searching by Unicode block, but that's probably not a feature for the first round of implementation. However, it wouldn't be terrible to look into the commonly-supported named Unicode regex patterns—see the section Unicode Blocks here, for example—while doing the initial simulated indexing investigation.

Another thought from Wiktionary: searching for rare characters in titles (especially zero-width non-joiners, directionality markers, soft hyphens, punctuation/whitespace outside of the Basic Latin Block, combining diacritics, etc.) would be useful. So maybe a titles-only index would be nice, too.

[Edit: explicitly mention soft hyphens, which came up in the continuing discussion.]

You could also define synonyms for many of these:

  • € -> euro
  • ® -> registered
  • 🤪 -> goofy face

I've been thinking for a while that having synonyms would be handy, also for abbreviations for instance. The problem is a bit how to curate such a list, and how to scale that to multiple languages. For english symbols you could easily source it from the unicode definition table of course, but other languages ??? Maybe generate mappings based on feeding symbols through wikidata mapping and getting their labels ? Hmm, wikidata also has an abbreviation property of course...

Yep, it's the same thing as for : the tokenizer—which breaks text up into words—is discarding the . Both characters match the Unicode regex for "Other Symbols", and so the standard Elasticsearch tokenizer discards them (along with punctuation) as "stuff between real words". More comments on specific results over at T95849.

I've been thinking for a while that having synonyms would be handy, also for abbreviations for instance. The problem is a bit how to curate such a list, and how to scale that to multiple languages. For english symbols you could easily source it from the unicode definition table of course, but other languages ??? Maybe generate mappings based on feeding symbols through wikidata mapping and getting their labels ? Hmm, wikidata also has an abbreviation property of course...

Yeah, I think synonyms are an interesting way to address this in part, but I think they are out of scope for this particular ticket, which is about finding a way to search for the exact "rare" character someone typed. Finding the appropriate "translation" for symbols would be a challenge, though the Wikidata angle is a good one, too.

We haven't really explored the thesaurus capabilities of Elasticsearch, though it is on our long list of "some day" potential projects (I added this idea to my version of the list, too.)

Abbreviations and acronyms are a another completely different topic I'd like to address, given infinite time and resources. For English, we currently break on periods, so ''NASA'' and ''N.A.S.A.'' don't match. I'd like to try to find likely acronyms and map them to period-less versions. Abbreviations (like ''abbrev.'') are another place where a thesaurus could help, but we've never dug into it. The implications for scoring (especially with our machine learning–based scoring for the top ~20 languages) are complex.

Another good use case from the Wiktionary discussion is regex searches without trigrams that the regex search acceleration can latch on to. As a use case, suppose you have a complex regex with no easy trigrams, but centered on finding specific cases of zero width non-joiners (ZWNJs). Adding a char:[ZWNJ] clause to the search vastly limits the universe of documents to be scanned with the much more complex regex, down to something that might finish before it times out.

This is also another argument for searching for ranges or character classes. My favorite Latin/Cyrillic homoglyph search would have a chance at completing on enwiki if it was limited to documents with at least once Cyrillic character in them (rather than trying to scan all 5M+ documents).

Another idea from the Extension talk page: something equivalent to char:emoji to help people find weird editing errors and vandalism. See T59884, T126047, and T129310 for cases of weird editing bugs generating emoji.

Unfortunately, a comprehensive emoji regex might be expensive. Simpler range-based ones, like [🌀-🙏🚀-🛳☀-☄☇-♬♰-✒✙-➿☎], might be more plausible and sufficiently useful for document-level indexing.

Another angle (from the Extension talk page): it might be useful to find articles with no rare characters (still looking for a concrete use case), so it makes sense in this initial investigation to track how many articles have no rare characters to see how well such a theoretical search conjunct would limit the scope of the more expensive part of a query.