Page MenuHomePhabricator

CirrusSearch: intitle does not work properly
Closed, DeclinedPublic

Description

Steps to reproduce:

  1. Search [[ https://cs.wikipedia.org/w/index.php?title=Speci%C3%A1ln%C3%AD:Hled%C3%A1n%C3%AD&profile=advanced&profile=advanced&fulltext=Search&search=hastemplate%3A%22archiv+diskuse%22+-intitle%3AArchiv&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns7=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns102=1&ns103=1&ns446=1&ns447=1&ns828=1&ns829=1&ns2300=1&ns2301=1&ns2302=1&ns2303=1&searchToken=6bgk8fp6fi8an5utcbyx80mmb | hastemplate:"archiv diskuse" -intitle:Archiv ]] in all namespaces on cswiki

Expected results:
The search phrase should find all pages with talk archive template, but exclude all which have archive in their title

Current results:
It excludes only half of them. In results there are still present titles containing archive

Event Timeline

Dvorapa created this task.Jan 27 2017, 9:50 AM
Dvorapa updated the task description. (Show Details)Jan 27 2017, 10:02 AM

Maybe a wildcard could help here? hastemplate:"archiv diskuse" -intitle:Archiv*. This is probably because the tokenizer on title does not split Archiv10 into two words.
It can help a bit I think but wildcards have also their limitations...

Dvorapa added a comment.EditedJan 27 2017, 10:18 AM

@dcausse It works, thank you. Should I classify this wildcard as a correct method or a workaround in this case?

@Dvorapa unfortunately it depends :(

This workaround is not perfect because:

  • intitle:Archiv* will match title with a word like Archivation which may not be what you want. Sadly you won't be able to see it because it would be excluded.
  • for performance reasons the wildcard will expand to only 1024 words, meaning that if the number of distinct words in the index that uses the pattern Achiv[NUMBER] is huge some of them may still be present in list. But it does not seem to be the case here.

It probably works for this specific usecase but I can not guarantee that using a wildcard will always fix similar situations.

I'd say that we can mark this particular task as resolved but please feel free to open a new one if you think you encounter a similar problem, so we can talk about possible workarounds.

@dcausse thank you for your explanation and sure, you can close this too. In this case Archivation could be excluded too, but I'll keep this issue in mind next time.

This comment was removed by Dvorapa.
dcausse closed this task as Declined.Jan 27 2017, 11:20 AM

Declining as a workaround is available for this specific usecase, the proper fix would be to change the analysis chain to split words on mixed letter/digit: Archiv10 => Archiv, 10. Changing the analysis chain that way may cause unexpected behaviors and reaching consensus sounds hard. It's safer (imo) to talk about possible workarounds if this kind of situations happen again in the future.

@dcausse btw another possibility would be intitle:/regex/i, but I understand this could slow down the whole operation (maybe intitle could only filter results of the remaining search pattern if it currently does not? (-)intitle:whatever would search for all pages and then as a second step filter results in/excluding whatever. it could help with performance)

@Dvorapa indeed we could think of adding the possibility to do regex on intitle://. I'll have to think about all the details but it sounds feasible. We would have to prioritize that work but I agree with you on the principle. I'm very inclined to support it because currently insource:// works only on the wikitext content (which does not include the title).

@dcausse If there will be anything I could help, please contact me (e.g. on cswiki). I currently use PetScan for regex search in titles, or pywikibot's listpage script (another possible workarounds), but PetScan has got some bugs in this and searching through pywikibot is too complicated and not direct (I can not click on a page from the result list).

@Dvorapa thank you for your feedback, I've created T156474, I hope we can evaluate the feasibility of this feature in a couple of months.