Page MenuHomePhabricator

Some characters are lost in title and search snippet highlights
Open, LowPublic

Description

Certain characters[1] are lost when highlighted in titles and text snippets.

To reproduce, search for [[ https://en.wiktionary.org/w/index.php?search=intitle%3A%2F%5B%F0%94%90%80-%F0%94%99%86%5D%2F+anatolian&title=Special%3ASearch&profile=default&fulltext=1&searchengineselect=mediawiki | intitle:/[๐”€-๐”™†]/ anatolian ]] on English Wiktionary.

The three results are ๐”ฑ๐”•ฌ๐”—ฌ๐”‘ฐ๐”–ฑ, ๐”–ช๐”–ฑ๐”–ช, and ๐”‘ฎ๐”“๐”—ต๐”—ฌ. However, they are displayed as ๐”–ฑ, ๐”–ช, and ๐”—ฌ; see screenshot:

Screen Shot 2020-01-08 at 3.50.49 PM.png (1ร—1 px, 187 KB)

Looking at the underlying HTML, the title of the first result (๐”–ฑ) contains several empty searchmatch spans: <span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch">๐”–ฑ</span>

I think this may have something to do with the characters being lost during tokenization (or being the kinds of characters that are lost during tokenizationโ€”maybe they are treated as punctuation?). If you search for ๐”‘ฎ๐”“๐”—ต๐”—ฌ (no quotes), the only hit is the exact title match. Searching for "๐”‘ฎ๐”“๐”—ต๐”—ฌ" (with quotes) gives zero results. I verified that the English text analyzer returns no tokens for the string ๐”‘ฎ๐”“๐”—ต๐”—ฌ.

Another example: [[ https://en.wiktionary.org/w/index.php?search=insource%3A%2F%5B%F0%94%90%80-%F0%94%99%86%5D%2F+anatolian&title=Special:Search&profile=advanced&fulltext=1&searchengineselect=mediawiki&ns828=1 | insource:/[๐”€-๐”™†]/ anatolian ]] restricted to the Module namespace gives a snippet with this:

canonicalName = "Anatolian Hieroglyphs", characters = "-",

characters = "-" is characters = "๐”€-๐”™†" in the original. The underlying HTML is &quot;<span class="searchmatch"></span>-<span class="searchmatch"></span>&quot;, again with empty searchmatch spans.

__ __ __
[1] I first discovered this when looking into T237332, so the examples so far are Anatolian Hieroglyphs, though other characters may be affected.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJan 8 2020, 9:06 PM

Assuming this task is about CirrusSearch, hence adding project tag so others can find this task when searching for tasks under that code project.