Page MenuHomePhabricator

Add method for more like api query to not boost pages by popularity
Closed, ResolvedPublic

Description

As part of rolling out the related articles feature to mobile @JKatzWMF has relayed a request from the community to *not* boost popular pages in the search ranking. This feature, boosting by the number of incoming links, has been a default feature of wikimedia search both in the old lucene search engine and now in CirrusSearch. Additionally CirrusSearch has a feature not present in the old lucene search that boosts pages which contain templates specified on wiki (https://en.wikipedia.org/wiki/MediaWiki:Cirrussearch-boost-templates).

The incoming link boosting feature is already available in a hacked together manner, purely for testing purposes. Any request, api or web, can send the query parameter cirrusBoostLinks=no and incoming links will not be considered in final scoring. There is unfortunately no current way to disable template boosting for morelike queries (they can be disabled in normal full text by including boost-templates:"" in the query) but we can come up with something to add. The abiity to disable one or both needs to be exposed in a manner that is explicitly supported in the API rather than via unregistered query parameters.

  • related page query scoring does not take into consideration incoming links
  • related page query scoring does not take into consideration templates (this one sounds like it is harder to do)...open to other ideas to mitigate the overly strong preference on featured articles
  • a list of before and after sample articles' results are published on the related pages project page https://www.mediawiki.org/wiki/Reading/Web/Projects/Related_pages

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Deskana Jon Katz stopped by my desk today and asked if we could work this out. They need this ability to be able to ship the related articles feature

In terms of MediaWiki / CirrusSearch internals, this would be best supported by using SearchEngine feature flag. Any arbitrary string can be used as a feature flag.

The main question, in my mind, before implementing this is how should it be exposed in the external API? Previously with the query rewriting feature we exposed an explicit API option enablerewrites. That seems reasonable because it's not something that is done to all queries, but is an optional feature that could be applied to some queries if the search backend thinks it would be useful.

With the feature in this ticket, popularity boosting happens for all queries by default and we want some way to turn it off. Ideally boosting links and boosting templates can be toggled independently, but it feels much too special cased to expose api flags for each one.

Options:

  • Add a hook to ApiQuerySearch::getAllowedParams(). Add CirrusSearch specific API parameters from a hook subscriber. I'm not sure how the parameter would then get passed on to CirrusSearch. Perhaps we need a second hook in ApiQuerySearch::run()?
  • Add a features multi-param to ApiQuerySearch. This would accept arbitrary input and somehow convert it into features to be applied on the SearchEngine instance. This is very generic, but i'm not sure i like it. arbitrary != discoverable.
  • Something else? @Anomie are there better options I don't know about?

As part of rolling out the related articles feature to mobile @JKatzWMF has relayed a request from the community to *not* boost popular pages in the search ranking.

@Deskana Jon Katz stopped by my desk today and asked if we could work this out. They need this ability to be able to ship the related articles feature

Let's slow down for a second here. This sounds incredibly odd to me. Why is this being requested? What problems is it causing? Can someone point me to the discussion?

The issue was discussed on the mobile mailing list, in the later parts of a thread named Similar articles feature performance in CirrusSearch for apps and mobile web. Linked concerns from users about the results are here and here.

Katz may be able to add more clarity to their purposes.

@EBernhardson thanks for moving this forward! As you suggested this afternoon, I think we can try this out and see how it goes in beta, before going beyond the hack. Is this something that @Jdlrobson and the funkybunch can do on their end? I would also caveat that this might not be a blocker, but we are definitely taking some hits for it. We are about to engage in an RFC with the community to get more feedback and will know in a couple of weeks if it is, indeed a blocker.

@Deskana the principle being evoked by community members, is that we don't necessarily want to optimize for clicks: putting scandalous news stories at the bottom of the page could do that. The members of the community that have spoken would be more amenable to highlighting obscure, but relevant pages rather than well-known pages.

Regardless of optimal balance, I agree that popularity is currently overweight-ed for this particular use case. According to @dcausse here: https://lists.wikimedia.org/pipermail/mobile-l/2016-February/010122.html, 'popularity' makes up 2/3s of the score. For search that makes a lot of sense--after all, the more popular something is, the more likely you are to be searching for it. However, for showing a user something they don't yet know about, popularity is arguably less valuable. JK Rowling appears on sooo many author pages and I think this might help.

We are currently reviewing all these rescore feature, it appears that cirrus does not allow a fine-grained configuration. Due to the nature of the formula we use it's nearly impossible to fine-tune the system (a bit more of incoming links, less of templates...).
The system is more or less binary : enable or disable the rescore feature.

I've added a profile that disables everything, you can experiment with it by adding this custom parameter : cirrusMoreLikeRescoreProfile=empty : Sarah_Mason_(novelist)
Without any rescore you'll certainly encounter other drawbacks of the system: it tends to rank small articles higher (but it's maybe what you want).

As pointed by @EBernhardson this is not an official parameter so we may want to either :

  • create a new API param that allows to switch between profiles
  • create a new dedicated API endpoint for morelike

We are trying to resolve the core issues and allow fine-grained tuning but it will require a lot more work...

Options:

  • Add a hook to ApiQuerySearch::getAllowedParams(). Add CirrusSearch specific API parameters from a hook subscriber. I'm not sure how the parameter would then get passed on to CirrusSearch. Perhaps we need a second hook in ApiQuerySearch::run()?

The hook for getAllowedParams already exists as 'APIGetAllowedParams'.

I note that you're already getting your "cirrusBoostLinks=no" parameter into Cirrus somehow, this could work in exactly the same way. Or we could add an "ApiConfigureSearchEngine" hook that gets called from the appropriate place.

  • Add a features multi-param to ApiQuerySearch. This would accept arbitrary input and somehow convert it into features to be applied on the SearchEngine instance. This is very generic, but i'm not sure i like it. arbitrary != discoverable.

I agree, arbitrary != discoverable and that's no good.

But we can do it non-arbitrarily: have a "features" multi-param that in core accepts only "rewrite" by default, then Cirrus could use the APIGetAllowedParams hook to add additional values for the param. ApiQuerySearch would just call $search->setFeatureData( $value, true ) for each string in the parameter, or it could iterate over all the possible values and do something like $search->setFeatureData( $value, in_array( $value, $param['features'], true ) ) if SearchEngine would prefer to have them all explicitly set to true or false.

This would deprecate the existing 'srenablerewrites' parameter in favor of 'srfeatures=rewrite', BTW.

  • Something else? @Anomie are there better options I don't know about?

For anything that's not Cirrus-specific, there's also the option to add it to core like 'enablerewrites' is now. But that doesn't seem like the case here.

The patches needed to control query independent ranking algorithms have been merged and should be available on all wikis when wmf5 is deployed to group2 wikis.
You can discover the profiles currently available by using ApiSandbox, see the new param qiprofile in generator=search.
We added only 3 profiles but we'll certainly add more in the future :

  • classic: the default
  • classic_noboostlinks: classic without boost links, (some templates on enwiki are still boosted)
  • empty: disable all query independent factors, added mostly for debug purposes because it will disable some cirrus special syntax (prefer-recent, boost-templates). But for morelike queries it's perfectly OK to use it.

@dcausse excellent! Thanks!

@Jhernandez who on web might help run a comparison of how this impact say ~30 results?

@JKatzWMF I can help if you tell me exactly what you need. Is it comparing 30 random titles and their morelike results?

Something like morelike:Sara Mason vs morelike:Sara Mason w/ cirrusMoreLikeRescoreProfile=empty?

  • How many results? First 3?
  • Random articles or a specific list of them?

@Jhernandez beware that the cirrusMoreLikeRescoreProfile param will be disabled when wmf5 is deployed to group2 wikis.

You should use the new qiprofile param when using the search api.
If you want to use Special Search for testing you will have to use cirrusRescoreProfile

Thanks @dcausse

@JKatzWMF here are the modes we can test:

classic

Ranking based on the number of incoming links, some templates, article language and recency (templates/language/recency may not be activated on this wiki).

classic_noboostlinks

Ranking based on some templates, article language and recency when activated on this wiki.

empty

Ranking based solely on query dependent features (for debug only).

I've spent some time making a script, and printed this 10 random articles: https://www.mediawiki.org/wiki/Extension:RelatedArticles/CirrusSearchComparison

From what I can see, classic differs quite a lot from both classic_noboostlinks and empty, and classic_noboostlinks and empty are very similar, with minor differences.

I haven't analyzed the quality of the results anyways.

If you want me to run it on more articles, either random or handpicked tell me about it and i'll do it so.

Any suggestions on wikitext format would be appreciated. Coloring cells somehow? I couldn't think of a good way of doing it without knowing what we're looking for.

I've been having a deeper look, and there are some really different cases:

For example for Nicole_Moudaber, with the classic scoring there are things like Hillary Clinton, Hezbollah which make no sense and have nothing to do with the article, and then a bunch of famous artists that may not have anything to do with this artist, but seems like are there because they are famous pages.

With the empty and noboostlinks the suggestions seem to be more accurate, other DJs and producers for example.

Gonna update the wikitext to link to the articles for easier research.

@Jhernandez Sorry I didn't get this to you earlier, I didn't realize this had been prioritized for work. I left in the commentary where I could so that we have some context to evaluate by. Here are two lists:

My favorites:
https://en.wikipedia.org/wiki/Will_Self
https://en.wikipedia.org/wiki/Kiss (lesbian?)
https://en.wikipedia.org/wiki/Amanda_Green (bernadette peters?)
https://en.wikipedia.org/wiki/Uttar_Pradesh (India...a little generic)
https://en.wikipedia.org/wiki/United_States_Senate_elections,_1922 (Minessota...why?)
https://en.wikipedia.org/wiki/A_Summer_Bird-Cage (this one is detailed here: http://permalink.gmane.org/gmane.org.wikimedia.mobile/5024)
https://en.wikipedia.org/wiki/Don_Perata

Community mentioned:
https://en.wikipedia.org/wiki/Revolution_Muslim and https://en.wikipedia.org/wiki/Chesser (both give Chess as a suggestion)
https://en.wikipedia.org/wiki/List_of_serial_killers_before_1900 gives https://en.wikipedia.org/wiki/Peru_national_football_team as a suggestion.
https://en.wikipedia.org/wiki/Murder_of_Kelly_Anne_Bates suggest https://en.wikipedia.org/wiki/Batman.
https://en.wikipedia.org/wiki/Korur_language suggest Anus, Anal and Poop.
https://en.wikipedia.org/wiki/Sarah_Mason_(novelist)
https://en.wikipedia.org/wiki/Murke%27s_Collected_Silences (had jaws and michael jackson)
https://en.wikipedia.org/wiki/Gabriels,_New_York
https://en.wikipedia.org/wiki/Isabel_Fonseca https://en.wikipedia.org/wiki/Andrew_Michael_Hurley https://en.wikipedia.org/wiki/The_Queen_of_the_Tearling https://en.wikipedia.org/wiki/Did_You_Ever_Have_a_Family, https://en.wikipedia.org/wiki/Tell_The_Wolves_I%27m_Home https://en.wikipedia.org/wiki/Cathy_Marie_Buchanan https://en.wikipedia.org/wiki/John_Michael_Cummings
https://en.wikipedia.org/wiki/Blood_Ties_(Hinton_novel)
https://en.wikipedia.org/wiki/Jack_(Homes_novel)

https://en.wikipedia.org/wiki/Opernpassage (vienna seems lame)
https://en.wikipedia.org/wiki/Yazoo_and_Mississippi_Valley_Railroad Listed articles: Memphis, Tenessee (central in link network), W.C. Hardy (bizarre, possibly central to network?) and Alabama (central in link network)

https://en.wikipedia.org/wiki/Aim%C3%A9_Ngoy_Mukena Military of the Democratic Republic of Congo (linked on page), Democratic Republic of Congo (linked on page), Lubumbashi (bizarre, possibly central to network?)

https://en.wikipedia.org/wiki/Francis_Patrick_Donovan Gough Whitlam (useful), Stanley Bruce (useful), Australia (central to network, linked on page)

https://en.wikipedia.org/wiki/Thomas_Meehan_(writer) Musical theatre (central, and linked on page), Maury Yeston (Unexpected, interesting connection: useful), Hairspray (2007 film) (linked on page)

https://en.wikipedia.org/wiki/Michael_E._Smith Aztec (central), OCLC (central and bizarre), Nahuatl (central)

https://en.wikipedia.org/wiki/James_Morrill Minnesota (central), W.E.B. Du Bois (central), Michigan State University (central and/or tangential).

@JKatzWMF No problem, it hasn't been prioritized, but I took prof. development time to solve this since it seemed interesting.

Now it's very easy to run the tests since I have the script done, so here you go:

I haven't done an in depth analysis, but there are many cases where the problems seem to dissapear (like Chess appearing in Revolution_Muslim for example).

@Jhernandez this is excellent!!!!!!! These results are really promising. I am excited to show this to those who expressed this specific concern.

debt claimed this task.
debt added a subscriber: debt.

Nice job, everyone!

@EBernhardson / @debt how do we use this in API? Could we have some sample API links to test with?
Not clear from subtask whether this task was actually resolved or closed by accident.

@Jdlrobson you have a new qiprofile param that can take 3 arguments:

  • classic: default
  • classic_noboostlinks: Ranking based on some templates, article language and recency when activated on this wiki.
  • empty: Ranking based solely on query dependent features (for debug only). For morelike it's ok to use this one.

You can use the api sandbox to discover the profiles available:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&generator=search&gsrsearch=morelike%3AAlbert+Einstein&gsrqiprofile=classic_noboostlinks