Page MenuHomePhabricator

Investigation: Searchability with category redirects
Closed, ResolvedPublic

Description

Get some facts on how articles in categories with redirect-aliases could be found and how redirect-alias categories could be used with Search / CirrusSearch

  • If an article is categorized with [[Category:Ärztin]] and Ärztin is a category with a hard redirect to Arzt how could/can we make sure the article is also found if someone looks for incategory:Arzt?
  • If someone is using the hard redirected category Ärztin with incategory:Ärztin what articles would/could be found then?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 9 2019, 2:58 PM
WMDE-Fisch assigned this task to awight.Aug 9 2019, 2:58 PM
WMDE-Fisch moved this task from Sprint Backlog to Doing on the WMDE-QWERTY-Sprint-2019-08-08 board.

After talking to @WMDE-Fisch we timebox this to one day max.

Following the CirrusSearch code:

  • Updater#buildDocument calls ->
  • ContentHandler#getDataForSearchIndex ->
  • ParserOutputSearchDataExtractor#getCategories ->
  • ParserOutput#getCategoryLinks
  • comes from ParserOutput.mCategories

Our LinkUpdates hook won't be able to override this value, it looks like we would have to do our work in the parser instead. This would be nicer because it's an upstream, authoritative source of information and the modified categories list would propagate to other hooks. The down side is that we would be introducing additional database lookups in an already hot class, but there's plenty of precedent for that.

A safer alternative would be to implement a SearchDataForIndex hook, which lets us manipulate the categories after they're collected by ContentHandler#getDataForSearchIndex.

Without any special code, a search for "incategory:Ärtzin" will return nothing. It would be reasonable to alter the code of CirrusSearch's InCategoryFeature to munge category titles and follow redirects, there's already a conversion from numeric category page ID to category title. We could add a hook to check for redirects here, but there's nothing available at the moment.

Thanks for the research @awight! One follow up question: incategory:Arzt would be returning everything in that category, including the ones labeled Ärztin?

Thanks for the research @awight! One follow up question: incategory:Arzt would be returning everything in that category, including the ones labeled Ärztin?

Yes, this will be true as a consequence of categorizing [[Category:Ärtztin]] articles under Arzt. This will be the only "real" categorization reflected in database links. Any magic we code for searching "incategory:Ärtztin" is based on transforming the category name to "Arzt" before performing the search.

@Tobi_WMDE_SW, @awight: WMDE-QWERTY-Sprint-2019-08-08 was archived but this task is still open. Can you either associate an active project tag, or resolve this task? Thanks!

awight closed this task as Resolved.Mon, Sep 23, 8:06 AM
awight added a project: TCB-Team.