Page MenuHomePhabricator

Cross-wiki search results: making sure the results are relevant from sister projects
Closed, DeclinedPublic

Description

We want to be sure that the results we show in the cross-wiki search results side bar are actually relevant from each project.

For example, if I search for 'pants' on Wikipedia, we don't want to show some obscure results from Wikispecies that might have the word 'pants' somewhere in the description (like 'cats wearing pants' or 'beetle crawling up my pants')

If there aren't relevant search results to show from any number of the sister projects, we shouldn't show the non-relevant ones just to have that project represented.

We just need to figure out a great way to determine relevancy for the sister project search results. Easy, right? ;)

Event Timeline

Some extra notes / random ideas:

  • We could empirically determine the 90th percentile of results scores from queries on the wiki, and for each wiki specify a minimum threshold. This would need to be automated if we are going to do it for all languages.
  • We could modify the query we use on sister projects to be higher precision (and not worrying about lower recall because we need only one result). A simple example would be to require a match in the article title. This could be too restrictive.

A/B tests on any of these "high value results" would be helpful, though we may have to think carefully about how/what we measure, since "better results" may result in fewer results—so cross-project ZRR is probably not a good metric.

Edit: After talking to David, I learned that 90th percentile scores probably won't work, but score/#searchTerms might. Of course we'd have to check into the details and feasibility of any proposed approach, but it's still good to brainstorm.

I stumbled upon an interesting example of low value result today on Italian WP. Searching for gato (Sp., etc., "cat") gives as the third result on Italian Wiktionary un (It., Sp., etc, "a"), because un gato ("a cat") is given as an example in Venetian.

Our plan for generic cross-wiki search is to only have one result, and of course the first result here is great (gato) and the second result is very good (gatto), but it's clear that weak results can score highly. I think this is more likely on Wiktionary, where there's likely to be little competition because there can be very few occurrences overall of the search terms in the wiki—including random usage examples like this.

Edit: After talking to David, I learned that 90th percentile scores probably won't work, but score/#searchTerms might. Of course we'd have to check into the details and feasibility of any proposed approach, but it's still good to brainstorm.

An arbitrary heuristic like this is probably fine to apply to inter-wiki results since it's not a big deal if results are not shown. I'd be wary of applying it to a standard set of search results for the wiki you're on, though. I know we're not considering this presently, but I wanted to get that on record. :-)

An arbitrary heuristic like this is probably fine to apply to inter-wiki results since it's not a big deal if results are not shown. I'd be wary of applying it to a standard set of search results for the wiki you're on, though. I know we're not considering this presently, but I wanted to get that on record. :-)

Yeah, the goal with the interwiki results is just to cull the bottom 50% that's dreck.

For more details on the 90th percentile score, the problem (in outline) is that each term gets a score and they are summed, so a score of 1.0 could be one term that scores 1.0, or ten terms that each score 0.1. The former is much better than the latter. score/#terms would normalize that. There's also the problem that different wikis may have different scoring methods (at least until the BM25 roll out is complete), so that might not work everywhere.

intitle: is looking better all the time. ;-)

This will be tricky, one thing to keep in mind is that having a dedicated query such as intitle sounds like the easiest solution but it has few drawbacks:

  • if we have to display the number of results in the side box it may be inaccurate
  • when the user will click on "Show more" we have to make sure that the result we displayed in the side box is the first result when the user lands in the target wiki

While looking at sistersearch a moment ago i realized that wikinews isn't getting the default 'prefer-recent' behaviour that we use when displaying results on wikinews itself. Likely that will be needed to show the most relevant news items

Yes it's an issue, I don't know if the prefer-recent is really important but I agree this is wrong.
We will likely rescore on pageviews as well which is probably not well tuned for sister wikis.

Then we would have to load the fullconfig from every sisterwiki?
Or maybe just special hack like an array:

$wgCirrusSearchInterwikiProfiles[
 'wikinews' => {
    'wgCirrusSearchRescoreProfile' => 'xyz',
    'wgCirrusSearchPreferRecentXyz' => '',
    // a subset of well known config vars that we know usefull and we can stuff into the config object like done by the UserTesting
    // or directly in some objects if the cirrus php api allows it

Similarly I think we should do something regarding namespace filtering, currently we will likely filter on namespace:0 which is likely to exclude interesting results from e.g. wikisource with the author namespace.
What we could do is to detect if namespaces == contentNamespaces on the host wiki then simply remove the namespaces from the sister wiki query

T156497 is one option forward here. The more hackish option is also possible, but i'm not sure how maintainable it will be

I prefer your solution, we could also a write a dedicated API for that purpose.
What we could do is to use the search tests fixtures with a custom SearchConfig object. This SearchConfig would record the list of needed globals that we could use to build the list vars that we have to export. Later we could make searcher test fails when a required config value is not present in this list.

debt removed dcausse as the assignee of this task.Mar 7 2017, 6:48 PM

We've moved this to the backlog for now - we can pick this up again later on, possibly even after this new feature (displaying search results from sister projects) would go into production.

The work and intent behind this task is important, but this task is not actionable and does not have clear acceptance criteria. I've declined it and replaced it with T156497: Change loading of cross-wiki configuration to use cirrus-dump-config api call which is meant to work along these lines.