Page MenuHomePhabricator

Investigate using blazegraph for deep category searching / returning of results
Closed, ResolvedPublic

Description

The good folks at WMDE have been working on a method of searching for and returning sub categories when a user requests it via special search.

We'd like to investigate if using Blazegraph would be a good method of use to implement this in production. It would probably be an extension into CirrusSearch (of some sort) to allow for searching (and returning results from) sub-categories when the main category is searched on, when using the advanced search functionality.

We'd also need to look into how to update the categories on some sort of regular basis, automatically.

This was discussed at the Vienna Hackathon 2017.

Event Timeline

Searching in subcategories is very likely to expose strange and unintended results to users due to the counterintuitive structure of categories. I gave a more detailed explanation of why this is in T160234#3117759. It's worth spending some time thinking about whether investigating technical solutions is worth it, given that the results from such a search may be polluted with irrelevant results.

https://petscan.wmflabs.org/ offers a related service right now. The sweet spot for category recursion seems to be 8 levels deep. Less deep: Missing results, deeper: all sorts of strange results.

Search has always worked with messy data as opposed to query where you work with structured data. This is a solvable search problem.

Would it be available for non-search functionality as well? Filtering change lists by category is something patrollers want very much.

This is something that we'll need @Smalyshev to take a look at, moving to Up Next in the board.

Would it be available for non-search functionality as well?

Probably yes, but not sure how exactly this will work. More details will follow in coming weeks.

Change 327862 had a related patch set uploaded (by EBernhardson; owner: Smalyshev):
[mediawiki/core@master] Produce RDF dump of all categories and subcategories in a wiki.

https://gerrit.wikimedia.org/r/327862

Stas recently announced that the category tree of a few wikis are now available as RDF dump and in Wikidata Query Service. More documentation is at:
https://www.mediawiki.org/wiki/Wikidata_query_service/Categories

deepcat: ор deepcategory: are the keywords for CirrusSearch, now merged.

Great! Will it be deployed with the train next week then?

I think the 1.31.0-wmf21 was cut just before this was merged so this will go to production wikis next week and when the config https://gerrit.wikimedia.org/r/#/c/410242/ is deployed. Earliest would be Friday 23 if we swat the config on Thursday evening.

We can deploy the config anytime, it doesn't do anything without the code. I'll put it into the next SWAT.

@Smalyshev I tried the keyword again today, and it is recognized, but the results seem to be exactly the same as the incategory search.
E.g. deepcategory:"Telekommunikation (Deutschland)" results in its 37 direct subpages, but none of the pages in the subcategories.

@Lea_WMDE I think the category download is broken now T188293: Categories download is broken, due to some URL changes that happened without my knowledge and broke some scripts. I'll fix it and re-test.