Page MenuHomePhabricator

Special:Search intitle search has weird redirect behavior
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Navigate to Special:Search on Wikipedia.org and enter the query intitle:"Wikipedia"
  • Click on the "Redirect from" red links and note the page titles.

What happens?:
The namespace name is doubled. For example, upon clicking on the "Wikipedia:INTADMIN" redlink in the search results of the link above, we are met with the page title "Wikipedia:Wikipedia:INTADMIN".
Also, some redirects do not exist and have never existed, for example "Wikipedia:Wikipedia:REDIRECT" to "Minecraft". "Wikipedia:REDIRECT" does not redirect to "Minecraft" either.

What should have happened instead?:
The namespace name should not be doubled; we should never have been taken to "Wikipedia:Wikipedia:INTADMIN".
Additionally, the results should not listed on search results, since the nonexistent redirect pages don't exist, meaning the title doesn't match the condition.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia): Bug occurs on WMF wiki

Other information (browser name/version, screenshots, etc.):

Screenshot 2024-08-13 at 4.27.12 PM.png (1×1 px, 685 KB)

Screenshot 2024-08-13 at 4.27.05 PM.png (914×1 px, 296 KB)

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Implement ndjson event sourcerepos/search-platform/cirrus-toolbox!1ebernhardsonwork/ebernhardson/ndjson-sourcemain
Correctly update redirects of existing pagesrepos/search-platform/cirrus-streaming-updater!157ebernhardsonwork/ebernhardson/redirect-updatesmain
Customize query in GitLab

Event Timeline

jeremyb-phone subscribed.

I tried changing search query to a different NS name, got the same red links and NS name doubling with "template" instead of "Wikipedia". https://en.wikipedia.org/w/index.php?search=intitle%3A%22template%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1

Thanks for reporting this, this has been sitting on my list for a while. https://species.wikimedia.org/w/index.php?go=Go&search=intitle%3Atemplate&title=Special:Search&ns0=1 was my case in case anybody needs another example.

Gehel triaged this task as Medium priority.Aug 19 2024, 3:51 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel subscribed.

Something does not make sense in those results. Let's investigate to see if we can understand the problem, and depending on the source of the issue and the complexity, let's re-discuss how to prioritize.

There are a few things going on here:

Redirects that shouldn't be indexed are being indexed

When building documents we have the following limitation on when redirects are indexed:

$redirect->getNamespace() === NS_MAIN || $redirect->getNamespace() === $title->getNamespace()

This does not look to be respected by in-place mutation of the redirect arrays when a new redirect is created

Redirects are not being cleared out on rebuild

Not sure why yet, but edits to the page that should be replacing the full document (including the redirects array) are not replacing it. Suspect we have something wrong with the noop handler parameters but more investigation is needed. Reminds me a bit of a long-standing concern with the noop handler that it should ideally either be nooping the whole document or none of it, but what it actually does is a per-field noop.

Mixup of title and prefixed title

When the redirects from another namespace are indexed they should not have the namespace prefix. But clearly this is happening here. That shouldn't be a problem for the main namespace since we are not supposed to index redirects from other namespaces, but redirects within non-main namespaces need to be investigated and validated for correctness here.

Attached patch is a partial fix, it addresses indexing redirects that shouldn't be indexed and the mixup of title and prefixed title. I'm still looking into why the redirects are not being correctly updated on document rebuild.

MR has been updated to also handle the problem where redirects were not correctly updated during a rebuild. Once that is deployed fixing the index will be two steps:

  1. We can use a regex query against redirect.title to find all pages with a : in the redirect array and issue a rerender. This should cover most of the bad data in the index
  2. The problem with the redirect array not always being updated is a little harder to fix directly. The in-place mutation should have covered most of it but clearly that doesn't work 100% of the tim. We will depend on the Saneitizer process which rerenders all pages over a 16 week period. Those pages will also update from normal edits (including null-edits).

Change #1064482 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add support for natural title sorts

https://gerrit.wikimedia.org/r/1064482

Ignore above patch, it's on the wrong ticket.

kicked off rerenders for all pages that have a : in the redirect.title field, based on the latest search index loaded into hadoop on the 26th. That should rerender any page with bad redirect data, along with a some number of pages that have legitimate : in the redirect titles. Unfortunately the dump is from the 26th and the fix was deployed on the 27th, but pulling the data directly from elasticsearch is simply too expensive as we can only find that by running regex queries.

This is estimated to take ~3 hours to rerender, will check in after that.

rerenders are complete, example replication query looks to be fixed.

Is this supposed to be fixed?

image.png (662×1 px, 242 KB)

(Edit: link.)

Edit 2: apparently this is a different error since the pages did in fact exist.

Is this supposed to be fixed?

image.png (662×1 px, 242 KB)

(Edit: link.)

Edit 2: apparently this is a different error since the pages did in fact exist.

All the other pages i was able to look up also existed at one point, the problem here is generally described as cross-namespace redirects being incorrectly indexed, and then not being cleaned up when the redirect was removed because it wasn't where it was expected to be in the search index

So the first example there is of Hipps_Boson/sandbox -> Sarah Wayland. It looks like Hipps_Boson/sandbox was the original location of that page, and it was then moved to Sarah Wayland. I'll have to look closer to see what exactly happened there and how that managed to get indexed that way.

Change #1071267 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] [WIP] Drop matches from bad redirects, and count them

https://gerrit.wikimedia.org/r/1071267

It looks like User:Hipps Boson/sandbox was mistakenly moved to Hipps Boson/sandbox in the main namespace on the same day, which explains how the redirect might have gotten there in the first place. Doesn't explain why it's still there.

Two pieces to resolve:

  • I generated a list of potentially bad indexed redirects by combining the latest cirrus index dump (from 2024-09-01) against the latest mw sql dump (also taken 2024-09-01). This generated a list of 68k pages. Spot checking a few it seems maybe 50% are actually incorrect, but it's a short enough list thats acceptable. Pushed rerenders for all of these pages. That appears to have resolved the example query above.
  • When a page is indexed into the search engine but doesn't actually exist we drop that from the results list. It seems like we should be doing the same for bad redirect matches. Attached patch does that, and additionally introduces a new metric we will need to add to our indexing dashboard that tracks the frequency of invalid results. I've run out of time to finish this today, but put up a WIP patch and re-opened this ticket to finish it out next week.

The patch to count should be merged soon-ish. Need to decide where it goes, probably into the Elasticsearch Indexing dashboard

Change #1071267 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Count matches from bad redirects

https://gerrit.wikimedia.org/r/1071267

Graphed results are a little surprising, worse than I would have hoped for, but also not all that bad. We receive ~20-30 results each hour that get excluded from the results. Another 20-30 results per hour are now having the non-existent redirect stripped. Of those ~5 results per hour might have only matched due to the redirect and are now a bit mysterious. Given that full text search does 300-500 req/s, which gives a lower bound of perhaps 20M pages inspected per hour, that is an still an incredibly small fraction.