Page MenuHomePhabricator

Allow customization of morelike for experimentation
Closed, ResolvedPublic

Description

Right now cirrus supports morelike:<pageName> for finding similar pages. It uses the more_like_this query from Elasticsearch (and in turn MoreLikeThis from Lucene). Anyway, we're using some pretty default parameters for it and readership would like to experiment with changing it. We should allow that. Here is what I wrote in an email that we can do:

$wgCirrusSearchMoreLikeThisConfig = array(

'min_doc_freq' => 2,              // Minimum number of documents (per shard) that need a term for it to be considered
'max_query_terms' => 25,
'min_term_freq' => 2,
'percent_terms_to_match' => 0.3,
'min_word_len' => 0,
'max_word_len' => 0,

);

Here is the reference for what they mean and any more we might be able to set: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

We only use the "text" field of the articles - no weighting based on, well, anything. See the text field in https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump for example.

Stuff we could do really, really easily:
1. Add url parameters that override each of those options for easy experimenting.
2. Add url parameters to use different fields like our weighted all field, the wikitext, or intro paragraphs (don't ask how we extract into paragraphs - its a horrible hack), or the section headers, or the "secondary" text like the inforboxes and image subtitles.

This task is to do #1 and #2. #1 is harder because we need to come up with reasonable limits on the values of the parameters.

  • Stakeholders: (1) Readership and (2) Editing
  • Benefits: (1) Allows readership to experiment with options to improve related article recommendations and (2) allows editing to experiment with options to improve next article to edit recommendations
  • Estimate: One or two days

Details

Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : masterAdd options to customize MoreLikeThis queries

Event Timeline

Manybubbles raised the priority of this task from to Needs Triage.
Manybubbles updated the task description. (Show Details)
Manybubbles added a project: Discovery.
Manybubbles moved this task to Search on the Discovery board.
Manybubbles added a subscriber: Manybubbles.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 2 2015, 3:35 PM
Tgr added a subscriber: Tgr.Jun 2 2015, 6:41 PM

The most obvious source for "more like this" referrals is the "See also" section, which is now ignored completely. Is there a way to take that into account?

The most obvious source for "more like this" referrals is the "See also" section, which is now ignored completely. Is there a way to take that into account?

Sure but its much more work. We could index a new field for it. I'd love to be able to do it in a way that works across all wikis but I don't see that happening.

Its not fair to say its completely ignored - its just that Cirrus doesn't think of that section as any different from the rest of the article text.

Tgr added a comment.Jun 2 2015, 7:53 PM

Why not? Just tell tech ambassadors to list the possible names for "See also"-type sections on some MediaWiki page.

Manybubbles set Security to None.
Deskana updated the task description. (Show Details)Jun 11 2015, 4:50 PM
dcausse claimed this task.Jun 12 2015, 3:22 PM

Change 220825 had a related patch set uploaded (by DCausse):
WIP: Add options to customize MoreLikeThis queries

https://gerrit.wikimedia.org/r/220825

Change 220825 merged by jenkins-bot:
Add options to customize MoreLikeThis queries

https://gerrit.wikimedia.org/r/220825

Deskana closed this task as Resolved.Sep 12 2015, 2:44 AM
Deskana moved this task from Done to Resolved on the Discovery-Search (Current work) board.
Deskana added a subscriber: Deskana.