Page MenuHomePhabricator

Allow ^ and $ in intitle regex search
Closed, ResolvedPublicFeature

Description

It would be nice to be able to use ^ for the beginning of the title and $ for the end of the title in regular expression searches of titles (intitle://).

At the moment there's no way to search for titles ending with gry as was recently brought up in a discussion on categories for words with suffixes that are not really suffixes on English Wiktionary. intitle:/gry$/ doesn't work. Years ago @Dixtosa created https://dixtosa.toolforge.org to do searches like this.

For prefix searches, Special:PrefixIndex works if you've got a literal prefix that narrows things down, but for anything more complicated you really need insource:/^/.

My impression is that ^ and $ were disabled in insource:// searches because it's unclear whether they mean start of line and end of line, or start of text and end of text, and maybe for performance reasons, but neither thing would be a consideration in titles, which don't have newline characters and can only be 255 bytes long. So intitle:// should be able to use ^ and $.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MPhamWMF moved this task from needs triage to Feature Requests on the Discovery-Search board.

I join this request. There is a workaround for the exact case you mentioned which goes intitle:/[a-z]gry/ -intitle:/[a-z]gry[a-z]/. But that won't work when you need to exclude, not include some ending (so you start with -intitle:/[a-z]gry/). In my use case, I need to exclude translation pages like "API:Search and discovery/ja" from search formatversion insource:/formatversion['"]?: ['"]?2/ on mediawiki.org. The best I can do is to use -intitle:/\/[a-z-]+/ which can have false positives.

Gehel raised the priority of this task from Medium to High.Apr 14 2025, 3:39 PM
Gehel moved this task from Feature Requests to Next Projects on the Discovery-Search board.

This will likely be a partially limited implementation, by that i mean ^foobar will be valid and work, but something more complex like (^|foo)bar will not be supported. It plausibly could with enough effort, but will require significant additional complexity which i suspect will not be worth the maintenance burden (likely having to fork the lucene regexp parser).

I could add it's not that ^ and $ were necessarily disabled, it's that the lucene regexp that we depend on has no concept of ^ and $. It always matches from the beginning to the end, no matter what. My undertsanding is that this is because in native lucene this regexp is intended to be applied to tokens (basically, words), and not the full text content of the field. In the support code that we have that allows this to work on the full field content we were wrapping the regexp in .*(regex).* to give partial matches

^foobar will be valid and work, but something more complex like (^|foo)bar will not be supported.

While this was the initial expectation, after working around the code for a bit I've found a way to support anchors in basically all the same ways that are supported by java's regex Pattern class. Something like (^|foo)bar will work as expected.

The ticket is expanding a bit. I found that due to the way we are supporting anchors that leads to unexpected results from character classes. Something like [^a] ends up matching the psuedo-anchor chars we are injecting. That is being fixed, but as long as we are already going to be transforming character classes it seemed reasonable to add support for shorthand character classes. It's currently expected that this patch will add support for \d, \s and \w character classes.

Change #1140518 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wmf-jvm-utils@master] Add lucene regex rewriting

https://gerrit.wikimedia.org/r/1140518

Change #1139557 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/extra@master] Support anchors in source_regex

https://gerrit.wikimedia.org/r/1139557

Change #1140518 merged by jenkins-bot:

[wmf-jvm-utils@master] Add lucene regex rewriting

https://gerrit.wikimedia.org/r/1140518

Change #1141951 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/highlighter@master] Highlight support for extended lucene regex

https://gerrit.wikimedia.org/r/1141951

Change #1139557 merged by jenkins-bot:

[search/extra@master] Support anchors and short char classes in source_regex

https://gerrit.wikimedia.org/r/1139557

Change #1142638 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] regex: Support extended syntax

https://gerrit.wikimedia.org/r/1142638

Once reviewed we will need to:

  • release new versions of the cirrus-highlighter and search-extra plugins
  • update the plugins .deb to contain the new plugin versions
  • update cirrussearch-opensearch-image to use the new .deb
  • update cindy to use the updated cirrussearch-opensearch-image
  • rolling restart prod clusters to load the new .deb
  • deploy the above CirrusSearch patch to use the functionality
  • reindex all wikis to change the trigram indexing options to support anchors

Change #1141951 merged by jenkins-bot:

[search/highlighter@master] Highlight support for extended lucene regex

https://gerrit.wikimedia.org/r/1141951

Change #1143156 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/opensearch/plugins@master] Bump ltr plugin to 1.5.4-wmf1-os1.3.20

https://gerrit.wikimedia.org/r/1143156

Once deployed, this seems like it's worth a User-notice to me, as new search functionality that folks might be interested in knowing about/using.

Change #1143156 merged by Ryan Kemper:

[operations/software/opensearch/plugins@master] Update plugins for extended regex support

https://gerrit.wikimedia.org/r/1143156

Change #1142638 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] regex: Support extended syntax

https://gerrit.wikimedia.org/r/1142638

Hello @EBernhardson,

For Tech News - What wording would you suggest as the content, and when should it be included? Thanks!

Hello @EBernhardson,

For Tech News - What wording would you suggest as the content, and when should it be included? Thanks!

For wording, something like:

  • regex search queries now support additional features including start-of-line (^) and end-of-line ($) anchors for the intitle keyword, as well as shorthand character classes for digits (\d), whitespace (\s), and word characters (\w) in both insource and intitle.

It's going to be at least another two weeks I imagine, the code above will ride the train and be deployed by July 17, then we have to reindex all the wikis which will take another 4-ish days and may not start immediately. So sometime late July, hopefully.

Mentioned in SAL (#wikimedia-operations) [2025-08-04T14:00:18Z] <ebernhardson> T317599 start full-cluster reindex for eqiad/codfw/cloudelastic opensearch clusters

Hello @EBernhardson,

For Tech News - What wording would you suggest as the content, and when should it be included? Thanks!

For wording, something like:

  • regex search queries now support additional features including start-of-line (^) and end-of-line ($) anchors for the intitle keyword, as well as shorthand character classes for digits (\d), whitespace (\s), and word characters (\w) in both insource and intitle.

q: please could the documentation be updated with the new functionality? I'd update it myself, but I'm worried I might accidentally write something wrong in case I don't fully understanding the changes :)

The reindexes have finally completed everywhere except jawiki. jawiki is blocked on T402220.

This should now be ready to document and announce.

This appears to be deployed in some part seeing as \s works, but ^ is not working like I would expect. This search for ^:\s*\<math does not find text like :<math>T = \frac{\hbar c^3}{8 \pi G M k},</math> in Stephen Hawking. In fact the search returns no results (when a fairly similar search does, which is how I picked out Hawking among many many others).

Did I do a wrong or is something not quite right?

@Izno IIUC, support for start- and end-of-line regex anchors is limited to the intitle: keyword (and not insource:)

Indeed, the start and end anchors are only applied to the intitle matching and not the insource. We could consider how to support the anchors there as well, but it gets a little more complicated. For intitle we inject utf-8 non-characters at the beginning and end of the string to then match against. for insource we would have to rewrite every \n in the input (at query and index time) to include the non-characters, but that would cause observable effects in other parts of the regex syntax. The only obvious solution is to keep a second indexed copy of the content that has the non-characters and flip between them, but doubling up on data doesn't seem great.

It's not impossible, but I don't currently have an idea of how to do it without breaking other parts of the regex.

Since you've already solved \^ and [^charclass], and lucene doesn't support backreferences anyway, rewriting ^ to "[" + START_ANCHOR_MARKER + "\n\r\v\f]" at query time (and similarly for $) should be equivalent, right?

More conservatively, start- and end-of-string, as in pcre without /m, still have value for insource:// - they can't currently be matched and we can use the existing [^ -􏿽] workaround to match newlines.

Perhaps I'm overthinking it, but my perceived difficulty is that ^ and $ are zero-width matches in multiline pcre. They assert the position without consuming the character, but I have no way to force the same when rewriting the regex. But perhaps it doesn't need to be perfect? A regex like "foo$bar" would match "foo\nbar" in our search, while in a regular regex engine the "$b" creates a regex that cannot match anything.

my perceived difficulty is that ^ and $ are zero-width matches in multiline pcre.

Guess I'll note T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries now. :^)

The reindexes have finally completed everywhere except jawiki. jawiki is blocked on T402220.

Moving to the 'Announce in next Tech/News' column now that the reindexing in T402220 seems to be complete, and this new functionality is hopefully now available on all Wikimedia wikis :)

Moving to the 'Announce in next Tech/News' column now that the reindexing in T402220 seems to be complete, and this new functionality is hopefully now available on all Wikimedia wikis :)

Thanks! I'll postpone including this until the following edition, partially in case T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries is completed by then and can be announced in the same entry (IIUC, that would be the clearest way to do so?). Plus, we'd need to update the proposed draft-entry to include both that other task and also link to any new documentation(?) about these new features. The current proposed draft-entry is still this:

  • regex search queries now support additional features including start-of-line (^) and end-of-line ($) anchors for the intitle keyword, as well as shorthand character classes for digits (\d), whitespace (\s), and word characters (\w) in both insource and intitle.

Thanks! I'll postpone including this until the following edition, partially in case T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries is completed by then and can be announced in the same entry (IIUC, that would be the clearest way to do so?). Plus, we'd need to update the proposed draft-entry to include both that other task and also link to any new documentation(?) about these new features. The current proposed draft-entry is still this:

  • regex search queries now support additional features including start-of-line (^) and end-of-line ($) anchors for the intitle keyword, as well as shorthand character classes for digits (\d), whitespace (\s), and word characters (\w) in both insource and intitle.
  • regex search queries now support additional features including start-of-line (^) and end-of-line ($) anchors for the intitle keyword, as well as shorthand character classes for digits (\d), whitespace (\s), and word characters (\w) along with escape codes for line feed (\r), newline (\n), tab (\t), and unicode (\uHHHH) in both intitle and insource regex search queries.

Can maybe link https://www.mediawiki.org/wiki/Help:CirrusSearch#Character_Classes which documents most of the new functionality (the rest is also documented on that page, but in a different section).

These two tasks are incredible. What about a light version for insource:, that will help in 90% of search cases IMHO? The rules are very simple:

  1. There can be or not exactly one ^ character at the beginning of any regex.
  2. There can be or not exactly one $ character at the end of any regex.
  3. There can't be any of these characters anywhere else in the regex.

That means, the meta-regex for the regex is

insource:/<^>?[^\^\$]*<$>?/

Even if later it will be expanded, it's a very good start. Please!