Page MenuHomePhabricator

Index redirect content in categories
Open, MediumPublic

Description

Summary
Categories with redirects don't appear in search results (particularly for searching using "intitle:").

Description
As described on English Wikipedia:

If, for instance, I search for intitle:Applied" with incategory:"Physics journals", I get the following. All is cool, that works.

https://en.wikipedia.org/w/index.php?search=intitle%3A%22Applied%22+AND+incategory%3A%22Physics+journals%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns10=1&searchToken=b6t28msp2y0zy79ysb63vpc54

However, if I search for intitle:Appl." with incategory:"Redirects from ISO 4", I get this. Which is no results at all.

https://en.wikipedia.org/w/index.php?search=intitle%3A%22Appl.%22+AND+incategory%3A%22Redirects+from+ISO+4%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns10=1&searchToken=4i52hwptx4fianpdm9cwg2wyb

How do I make this work? Can it be done?

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#How_to_find_intitle_matches_for_redirects_in_a_category?

Notes

Some initial thoughts from search engineers on #wikimedia-discovery IRC:

11:13 AM <•ebernhardson> guy CKoerner_WMF: problem isn't clear, the two certainly work together. intitle matches redirects, and redirects are the "same" document as the one it redirects to so categories match too
11:15 AM unless the user can present a counter case, my assumption would be search is working and there are no results
11:17 AM CKoerner_WMF: so, i looked at there is exactly 1 page in the category linked. And that page does not have 'Appl.' in the title or redirects
11:17 AM hmm, maybe they are elsewhere though...
11:20 AM so https://en.wikipedia.org/wiki/Category:Redirects_from_ISO_4 has 14k pages, and incategory:Redirects_from_ISO_4 has 1 result. no clue why though
11:20 AM <•dcausse> I think they expect the redirect page to be in the category
11:21 AM <•ebernhardson> guy dcausse: it is, from the perspective of cirrus, isn't it?
11:21 AM dcausse: the redirects are just the document they point at
11:21 AM oh, wait are the redirects *themselves* tagged? hmm
11:21 AM <•dcausse> but the redirects can have wikitext
11:21 AM <•ebernhardson> guy hmm, if that's the case then yea
11:21 AM <•dcausse> but here it's not really a category, it's {{R from ISO 4}}
11:22 AM https://en.wikipedia.org/w/index.php?title=Abstr._Appl._Anal.&redirect=no does not mention any category
11:23 AM <•ebernhardson> guy dcausse: you're right, i looked through and afaik the categories are all on the redirect page, not the page it links to, so we don't index it
11:23 AM not really sure how that would fit in the model. hmm.
11:24 AM <•dcausse> https://en.wikipedia.org/wiki/Wikipedia:Categorizing_redirects#How_to_categorize_a_redirect
11:25 AM seems complex :/
11:27 AM the category is marked with a template but this template is lost when populating the main doc with the redirect data :(
11:36 AM <•CKoerner_WMF> Chris Koerner Should I file a bug?
11:36 AM <•dcausse> we should index redirect content at some point but that seems tricky to do right. perhaps with a redirect_doc : [ {id:XYZ, text:, vesrsion:.}, {...} ] and a special noop handlers

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'll point out that searching for

insource:"R from ISO 4"

https://en.wikipedia.org/w/index.php?search=insource%3A%22R+from+ISO+4%22&title=Special:Search&profile=default

or

hastemplate:"R from ISO 4"

https://en.wikipedia.org/w/index.php?search=hastemplate%3A%22R+from+ISO+4%22&title=Special:Search&profile=default

won't work either. The issue seems to be that ALL redirects aren't indexed for searching anywhere, so you simply cannot look into them.

Redirects are indexed, but not as their own thing. Redirects in search are considered a property of the page linked to. Because of this the only information kept is the namespace and title of the redirect. Essentially the inability to find the categories or source text of a redirect is baked into the original design of CirrusSearch.

We will have to think on appropriate solutions. I don't think separating redirects out into their own documents will be quite right. We will possible want to find ways to store the appropriate information as part of the document being redirected to much like the namespace+title. Simply adding the existing fields to the redirect representation unfortunately won't work, as the relationship between properties is lost.

More concretely, simply extending what we currently have the theoretical query incategory:foo intitle:bar would be unable to match a single redirect in a page, rather it would see all categories from redirects as a single thing, and all redirect titles as a single thing, but have no information on the relationship between them.

EBjune triaged this task as Medium priority.May 10 2018, 5:12 PM
EBjune moved this task from needs triage to search-icebox on the Discovery-Search board.

This is possibly related to T63080 and T143409

Vvjjkkii renamed this task from Index redirect content in categories to sgdaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from sgdaaaaaaa to Index redirect content in categories.Jul 2 2018, 3:16 PM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.