Page MenuHomePhabricator

Allow more_like_this searches (return articles similar to query text)
Closed, ResolvedPublic

Description

Author: alf

Description:
Elasticsearch provides a "More Like This" query, which - given some initial text - extracts the key terms and uses those to build a new query, returning the documents that best match those terms.

If this was available in MediaWiki's search API, it would allow the index to be queried by example. This can be useful for finding Wikipedia articles that are most similar to a starting document (e.g. "Wikipedia articles related to this page", alongside a news story), and also for automatically categorising documents (using the categories that have been attached to the most similar Wikipedia articles).

An example query: https://gist.github.com/hubgit/6365895

Most of those parameters (fields to query, fields to return, number of items to return, query text) can be passed through as query parameters, and the others (min_term_freq, max_query_terms, percent_terms_to_match) can be hard-coded to values appropriate for the index.

It might be appropriate to use POST for the query, as the query text can be a whole document.


Version: unspecified
Severity: enhancement

Details

Reference
bz53474

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:58 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz53474.
bzimport added a subscriber: Unknown Object (MLST).

Change 83819 had a related patch set uploaded by Manybubbles:
Tests for morelike:.

https://gerrit.wikimedia.org/r/83819

Gerrit Notification Bot just added my integration tests to the bug as well.

So what I've implemented is the users can search for:
morelike:<article name>
and we do a mlt search against the article's text.

Change 83819 merged by jenkins-bot:
Tests for morelike:.

https://gerrit.wikimedia.org/r/83819

Verified on enwikisource. It is slow! 12 second searches. It is fun though.