Page MenuHomePhabricator

Regex in CirrusSearch can't find Anatolian Hieroglyphs
Closed, ResolvedPublicBUG REPORT

Description

Searching for Anatolian Hieroglyphs in wikitext in the mainspace in English Wiktionary doesn't yield any results, even though there are pages that contain these characters. These are outside the BMP, so perhaps there is some weird bug. However, not all non-BMP characters cause problems.

Searching for Gothic letters (: insource:/[๐Œฐ-๐Š]/) and Egyptian hieroglyphs (: insource:/[๐“€€-๐“ฎ]/), which are also outside the BMP, works. I considered the possibility that it might be a Unicode versioning issue, but I'm not sure why regex would refer to the UCD when searching for ranges of code points. (In any case, the Anatolian Supplements block was added in version 8.0, Egyptian Hieroglyphs in 5.2, and Gothic in 3.1.) So I'm mystified as to why some non-BMP characters could be searched for and others couldn't. I wonder if I'm overlooking something obvious here.

Steps to Reproduce:
Enter : insource:/[๐”€-๐”™†]/ or : intitle:/[๐”€-๐”™†]/ in the search box in English Wiktionary and submit.

Actual Results:
No search results.

Expected Results:
Both regexes should match the code points in the range U+14400-U+14646 (ANATOLIAN HIEROGLYPH A001 to ANATOLIAN HIEROGLYPH A530) in English Wiktionary. As of the 2019-11-01 dump, the wikitext of two entries contained them โ€“ ๐’บ๐’Œ“๐’‹ป๐’Š‘๐’„ฟ๐’€€๐’‹พ๐’…– and ๐’‹ผ๐’‚Š๐’ƒท โ€“ and as of the 2019-10-20 dump three entry titles contained them โ€“ ๐”ฑ๐”•ฌ๐”—ฌ๐”‘ฐ๐”–ฑ, ๐”‘ฎ๐”“๐”—ต๐”—ฌ, ๐”–ช๐”–ฑ๐”–ช.

Postscript:
I do get one result for insource:/[๐”€-๐”™†]/ in all namespaces: Module:scripts/data. That module contains the literal string ๐”€-๐”™†. It's as if the regex engine fails to parse [๐”€-๐”™†] correctly and instead searches as if the query were insource:/๐”€-๐”™†/. That doesn't make sense to me though.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptNov 4 2019, 10:12 PM
Erutuon updated the task description. (Show Details)

It seems the reason is that the search engine doesn't search every page. Probably it times out at a certain point when searching through mainspace pages, and if no the Anatolian hieroglyphs are found by that point, no results are shown.

I was cleaning up C1 Controls in entries in the English Wiktionary and had cleaned up all that the search engine found, but many still remain (which I found by searching the dump).

It would be nice if there were an indication that the search timed out. I've seen that sometimes (when the regex is complex), but not in this case, and I'm not sure what triggers it.

One proposal that would prevent this type of search from timing out would be a rare character index (T211824), if the index tracked non-BMP characters, or Unicode blocks, or Unicode scripts.

Indeed we used to display a timeout warning and it seems to have disappeared in this case, thanks for the report.
As for your particular usecase I'm afraid that if you don't provide more hints (filters) to the search engine it will have to scan all pages sequentially and will be as slow as running a grep on a dump. Once the timeout issue is resolved I suggest that we merge this task into T211824 as it is definitely a perf issue and I'm afraid the regex search mechanism does not have enough information to avoid a fullscan.

EBernhardson moved this task from needs triage to elastic / cirrus on the Discovery-Search board.
TJones claimed this task.
TJones added a subscriber: TJones.

@Erutuon, it looks like this issue is resolved, since the underlying problem was the lack of timeout message.

As a practical matter, if you can add regular search terms to your query to limit the number of documents that need a regex scan, your query is less likely to time out. insource:/[๐”€-๐”™†]/ anatolian and intitle:/[๐”€-๐”™†]/ anatolian are both very quick to return, but depend on anatolian being present in the relevant entries. Depending on your use case this might suffice, or it at least might allow you work on the "easy" cases with a snappy search before having to fall back to dumps.

I've found a problem with the highlighting of the Anatolian Hieroglyphs, which seems to delete certain characters. I will document it in a new ticket if I don't find one that already exists.