Page MenuHomePhabricator

Add the possibility to do regex search on titles
Closed, ResolvedPublic

Description

Currently insource:// allows to do regular expression only on the wikitext content. This content does include the title string. We could think of adding the possibility to do regex search on the title as well.
It would allow users to do more advanced searches on the title text.
It can be useful in cases where the wiki uses certain kind of naming conventions that are to find with the way we tokenize the title text. Allowing regex on title would help to workaround these kind of limitations.
Of course we would have to evaluate the performance implications regarding the overhead of adding a new trigram field but I think it's worth investigating this possibility while working on more advanced syntax support in Q4 goals (2017).

See T156460 for an example usecase where it could have been useful.

Event Timeline

dcausse created this task.Jan 27 2017, 11:44 AM
Restricted Application added a project: Discovery. · View Herald TranscriptJan 27 2017, 11:44 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana triaged this task as Normal priority.Feb 2 2017, 11:24 PM
Deskana moved this task from needs triage to later on... on the Discovery-Search board.

Adding a note that a few English Wikipedian's have discovered this limitation and would like to see this improved.

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Search_intitle:_doesn.27t_work_right_with_quoted_strings_that_include_a_space

Izno added a subscriber: Izno.Oct 26 2017, 7:23 PM

I don't think this would be particularly hard to implement, all the functionality already exists. We need to add the appropriate sub-fields to title and adjust the intitle: keyword to swap between term matching and regex matching the same as insource: does today.

Izno added a comment.Oct 26 2017, 8:00 PM

I don't think this would be particularly hard to implement, all the functionality already exists. We need to add the appropriate sub-fields to title and adjust the intitle: keyword to swap between term matching and regex matching the same as insource: does today.

Just as a reminder, you left a comment in January at T156510 about this task:

Regex, as suggested above, might be a reasonable way forward there but would need some consideration as it seems possible our regex acceleration wouldn't be as effective on short fields such as title.

Certainly it is possible the shorter field will allow for significantly less filtering prior to running the regex. The acceleration phase basically extract sets of trigrams (three sequential characters) that must be in the searched content from the regex and then look for documents containing those trigrams as a first pass filter. This generally reduces the number of articles we need to run the regex on significantly. I think it is worth keeping in mind, and evaluating.

And just as a side note for implementation, this certainly needs to be applied to both the title field and redirect.title

Change 413896 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Support regex for intitle keyword

https://gerrit.wikimedia.org/r/413896

TJones added a subscriber: TJones.Feb 27 2018, 3:32 PM

Don't forget to update the documentation! Is that a separate task?

Yes, but since this keyword will fail before all wikis are reindexed I think we should postpone any doc addition until the reindex is done. A task for adding it to the doc is good idea.

Change 413896 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Support regex for intitle keyword

https://gerrit.wikimedia.org/r/413896

dcausse moved this task from This Quarter to Current work on the Discovery-Search board.
dcausse moved this task from in progress to Done on the Discovery-Search (Current work) board.
debt closed this task as Resolved.Mar 1 2018, 6:44 PM

Once we re-index, this will be in production (with documentation)

Nirmos added a subscriber: Nirmos.Mar 15 2018, 7:20 AM
Izno added a comment.Apr 3 2018, 4:01 PM

Yes, but since this keyword will fail before all wikis are reindexed I think we should postpone any doc addition until the reindex is done. A task for adding it to the doc is good idea.

@debt Since you closed the subtask, Help:CirrusSearch does not document it yet. :)

Yes, but since this keyword will fail before all wikis are reindexed I think we should postpone any doc addition until the reindex is done. A task for adding it to the doc is good idea.

@debt Since you closed the subtask, Help:CirrusSearch does not document it yet. :)

Ah yes, thanks for the reminder! I've created T191340 for this documentation task.