Page MenuHomePhabricator

CirrusSearch: Problems on the Gujarati wikipedia that look like unicode normalization issues
Closed, ResolvedPublic

Description

Hi Nik,

Thanks for deploying it on gu.wiki. I have been testing it so far and always found it more useful than normal search, but today I encountered an issue with this. Please see below 4 search results, 2 with cirrus search and 2 without, I think the results that I am getting with cirrus enabled are a bit unexpected. The term that I search is સૌરાષ્ટ્ર પ્રાંત

With Cirrus:
https://gu.wikipedia.org/w/index.php?search=%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4&button=&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7&srbackend=CirrusSearch

Without Cirrus:
https://gu.wikipedia.org/w/index.php?search=%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4&button=&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7

Search was for exact match with inverted comma: "સૌરાષ્ટ્ર પ્રાંત"

Without Cirrus:
https://gu.wikipedia.org/w/index.php?search=%22%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4%22&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7&fulltext=1

With Cirrus:
https://gu.wikipedia.org/w/index.php?search=%22%E0%AA%B8%E0%AB%8C%E0%AA%B0%E0%AA%BE%E0%AA%B7%E0%AB%8D%E0%AA%9F%E0%AB%8D%E0%AA%B0+%E0%AA%AA%E0%AB%8D%E0%AA%B0%E0%AA%BE%E0%AA%82%E0%AA%A4%22&title=%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7%3A%E0%AA%B6%E0%AB%8B%E0%AA%A7&fulltext=1&srbackend=CirrusSearch


Version: unspecified
Severity: normal
See Also:
T41501

Details

Reference
bz57242

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:23 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz57242.
bzimport added a subscriber: Unknown Object (MLST).

This took me forever to pickup but I see this:
[[સૌરાષ્ટ્ર| સૌરાષ્ટ્ર પ્રાંત]]માં
in the page source of one of the pages that lsearchd finds and cirrus doesn't. Cirrus sees the words સૌરાષ્ટ્ર પ્રાંતમાં which lsearchd sees સૌરાષ્ટ્ર પ્રાંત માં because it inserts a space after every link.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
TJones claimed this task.
TJones subscribed.

This ticket seems to be about how lsearchd parses wikitext, which isn't really relevant anymore. I can't reproduce the unwanted parsing behavior with current on-wiki search, so I'm closing the ticket. If there is still a problem, please re-open with new examples, or open a new ticket.