Page MenuHomePhabricator

can't use incategory: or intitle: when category or title contains double quotes
Closed, ResolvedPublic

Description

in hebrew, double quotes is often used to denote acronyms. e.g., [[he:Category:כדורגלני בית"ר ירושלים]].
afaicr, the "old" search allowed for this by interpreting double-double-quotes ("") as "escaped" double quotes, so you would search for
incategory:"כדורגלני בית""ר ירושלים"

cirrus search does not allow this.

i may be mistaken WRT old search (is there a way to test old search on hewiki?), but i know it *is* broken with cirrus.

peace.

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:45 AM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz71123.

You can test the old search by adding &srbackend=LuceneSearch to the url. It seems to blow up with an http error at the moment though. Sad:(

Anyway, you are right that there is no way to search with that quote in there in Cirrus. I'll fix it now but it'll take until next Thursday (really late GMT) for it to be deployed.

In other places in Cirrus you can escape it by placing a " before it so that is what I will implement. Let me know if that works for you.

gerritadmin wrote:

Change 161986 had a related patch set uploaded by Manybubbles:
Support escaped quotes in filters

https://gerrit.wikimedia.org/r/161986

sorry for not being clear - we need the same logic for intitle:
(quotes can be part of an article name, just as they appear in cat names)

as far as i could understand the patch, it should solve this also, but the description (aka "commit message") talks specifically about incategory, so i wanted to make sure...

i think the same "escape mechanism" already works for the search string itself, no?

peace.

It should work with all the filters like intitle, incategory, and linksto. I've updated the commit message to be more clear.

gerritadmin wrote:

Change 161986 merged by jenkins-bot:
Support escaped quotes in filters

https://gerrit.wikimedia.org/r/161986

All patches mentioned in this report were merged or abandoned - is there more work left to do here (if yes: please reset the bug report status to NEW or ASSIGNED), or can you close this ticket as RESOLVED FIXED?

if the patched code is deployed, then the answer is "no".

for instance, hewiki contains two articles starting with אומ"ץ (https://he.wikipedia.org/wiki/%D7%90%D7%95%D7%9E%22%D7%A5 ,
https://he.wikipedia.org/w/index.php?title=%D7%90%D7%95%D7%9E%22%D7%A5_%D7%99%D7%A9%D7%A8%D7%90%D7%9C%D7%99&redirect=no )
and possibly a few more containing it. :
i could not find it using intitle. tried
intitle:אומ"ץ
intitle:אומ""ץ
intitle:אומ"""ץ
intitle:"אומ"ץ"
intitle:"אומ""ץ"
intitle:"אומ"""ץ"

none of them found the article.

peace.

Does intitle:"אומ"\ץ" work?

it doesn't work (tested, but don't take my word for it - you can easily test it yourself) and i don't think it expected to work anyway. ttbomk, backslash was never "escape character for wiki search.

peace.

[ Resetting assignee as assignee account is not active anymore ]

I took a quick look at this, and it is still a problem.

I think Manybubbles' patch allowed the quote to make it through one layer of the software, but the quote seems to be stripped somewhere else.

intitle: queries go through the same analysis chain as regular queries, so lots of punctuation and other stuff is ignored. On English WP, for example, searching for " \"quantum\" , \": '?! () leap" is the same as searching for "quantum leap" and searching for intitle:" \"quantum\" , \": '?! () leap" is the same as intitle:"quantum leap". All the extra punctuation is ignored.

Hebrew wikis are currently using the "default" analyzer, which is Unicode aware, and it actually seems to know that " can be meaningful in Hebrew, and it leaves it in. אומ"ץ is analyzed as אומ\"ץ, but abc"xyz is broken up into abc and xyz by the same analyzer. So, I'm not sure where the double quote is getting stripped.

While it doesn't look like the language analyzer is causing the problem, whoever works on this should keep in mind that a new Hebrew language analyzer is coming soonish (deployment is complicated, so it has been delayed). Deployment and re-indexing are tracked in these two tickets:

debt raised the priority of this task from Low to Medium.Jul 11 2017, 4:10 PM
debt added a project: Discovery-Search.
debt edited subscribers, added: EBernhardson, dcausse, debt; removed: Manybubbles.

Let's take a look at this again, after T167057 and T167058 are deployed.

The new language analyzer has been in place for a while, and it doesn't make any difference for this issue.

Ran a new test and at least on hebrew the analysis chain doesn't seem to be getting in the way. I issued a simplified version of our production query (P7534) and got back two results:

{
  "title": [
    "תנועת אומ\"ץ"
  ]
}
{
  "redirect.title": [
    "אומ\"ץ ישראלי"
  ]
}

example query showing the problem, the query string is escaped differently in the must and filter clauses of the primary query: https://he.wikipedia.org/w/index.php?search=intitle%3A%22%D7%90%D7%95%D7%9E%5C%22%D7%A5%22&fulltext=1&cirrusDumpQuery

This implies the issue is somewhere in CirrusSearch query building. Will dig deeper and add some tests.

Change 459848 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Fix gershayim double escaping in quoted string

https://gerrit.wikimedia.org/r/459848

Change 459848 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Fix gershayim double escaping in quoted string

https://gerrit.wikimedia.org/r/459848