Page MenuHomePhabricator

CirrusSearch: Problems on the Gujarati wikipedia that look like unicode normalization issues
Open, MediumPublic

Description

Hi Nik,

Thanks for deploying it on gu.wiki. I have been testing it so far and always found it more useful than normal search, but today I encountered an issue with this. Please see below 4 search results, 2 with cirrus search and 2 without, I think the results that I am getting with cirrus enabled are a bit unexpected. The term that I search is સૌરાષ્ટ્ર પ્રાંત

With Cirrus:
https://gu.wikipedia.org/w/index.php?search=%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4&button=&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7&srbackend=CirrusSearch

Without Cirrus:
https://gu.wikipedia.org/w/index.php?search=%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4&button=&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7

Search was for exact match with inverted comma: "સૌરાષ્ટ્ર પ્રાંત"

Without Cirrus:
https://gu.wikipedia.org/w/index.php?search=%22%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4%22&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7&fulltext=1

With Cirrus:
https://gu.wikipedia.org/w/index.php?search=%22%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4%22&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7&fulltext=1&srbackend=CirrusSearch


Version: unspecified
Severity: normal
See Also:
T41501

Details

Reference
bz57242

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:23 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz57242.
bzimport added a subscriber: Unknown Object (MLST).

This took me forever to pickup but I see this:
[[સૌરાષ્ટ્ર| સૌરાષ્ટ્ર પ્રાંત]]માં
in the page source of one of the pages that lsearchd finds and cirrus doesn't. Cirrus sees the words સૌરાષ્ટ્ર પ્રાંતમાં which lsearchd sees સૌરાષ્ટ્ર પ્રાંત માં because it inserts a space after every link.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript