Page MenuHomePhabricator

Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others
Open, Needs TriagePublic

Description

If you search for morelikethis:Art|Physics, you get all physics results: morelikethis gets a representative set of words from those articles and assigns some weight to them based on their frequency in the wiki, and there is no reason for those weights to be equal. If we want to show a somewhat diverse mix of tasks, we need a way to equalize the score contribution of different topics. (Not sure how many topic pairs are affected by this problem though. It is obviously language-dependent, and probably depends on the difficulty filter settings as well.)

ORES drafttopic search might have similar issues (the current plan is to implement it by using morelike search on the drafttopic parameters), so while fulltext-based morelike search might not be around for long, experimenting with solutions for this problem probably does have lasting value.

Details

Related Gerrit Patches:
mediawiki/extensions/GrowthExperiments : wmf/1.35.0-wmf.14Newcomer tasks: Make a separate search query for every topic
mediawiki/extensions/GrowthExperiments : masterNewcomer tasks: Make a separate search query for every topic

Event Timeline

Tgr created this task.Jan 10 2020, 11:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2020, 11:09 PM
Tgr added a comment.Jan 12 2020, 11:27 PM

Another option is to search for each topic separately and then interleave, as Kosta suggested elsewhere.

Change 563810 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/563810

Change 563810 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/563810

Tgr added a comment.Jan 13 2020, 8:35 PM

With the patch merged, we now search for each topic separately, and interleave the results (and then randomize the order of the whole thing), so this is not an issue. I think it would be good to return to doing everything in a limited number of searches in the future, in which case it will become an issue again.

Change 564162 had a related patch set uploaded (by Catrope; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/564162

Change 564162 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/564162

There are two main options I can think of for merging essentially two separate morelikethis queries trying to give them equal weight:

  1. Use a weight parameter per-query. This is plausible, but must be tuned on a per-query basis.
  2. Normalize each score to [0,1) with a sigmoid-like function and sum the two. By normalizing and adding we prefer results that score highly for both queries. For this task some testing suggests a low midpoint and a steep slope on the sigmoid will do best, as the high-performing results will generally score between [0.8,1), meaning our best results have to have scored well in both, and not just happen to match a word or two.

The sigmoid used is x^a / (x^a + m^a), where a = steepness of the sigmoid, and m = midpoint, or the value of x where result == 0.5. Highly suggest plugging values into a graphic calculator (desmos or similar) to understand what this does to the scores. A few different result sets for different parameters is in the table below. A result set like mathematical beauty, la femme au cheval, symmetry wouldn't be great, but better than whats returned today. I would additionally worry a bit that the parameters that return plausible values for this particular query may not work well on other results. We essentially need a larger set of test queries that this current one to evaluate from.

mida123
500.8Branches of physicsMechanicsHistory of physics
503MechanicsHistory of physicsClassical Mechanics
505MechanicsClassical mechanicsHistory of physics
508Goethean scienceLightNatural science
250.8History of physicsBranches of physicsMechanics
253Mathematical beautyLa Femme au ChevalSymmetry
255Mathematical beautyLa Femme au ChevalSymmetry
258La Femme au ChevalMathematical beautySymmetry
150.8History of physicsMechanicsBranches of physics
153La Femme au ChevalMathematical beautySymmetry
155Mathematical beautyLa Femme au ChevalSymmetry
158La Femme au ChevalMathematical beautySymmetry
50.8History of physicsNatural scienceMechanics
53La Femme au ChevalSymmetryMathematical beauty
55La Femme au ChevalMathematical beautySymmetry
58La Femme au ChevalErnst Wilhelm von BruckeGeometry
Tgr added a comment.Fri, Jan 24, 11:29 PM

Cool, that sounds a lot less effort to maintain than per-topic weights. If defining a custom function is easy, maybe we could use the max of the sigmoids instead of the sum? I think that's closer to how topic search is intended to work (if I check that I'm interested in art and physics, I would want articles that are about art or physics, not necessarily both).

@MMiller_WMF any thoughts about that? If you set arts + physics, which would you prefer? A mix of very typical art articles and very typical physics articles, or articles that are related to both?

Although I guess another consideration is that when we do something similar for ORES scores, we want to expose those as a search keyword, and there it would be pretty cool to be able if we could return Symmetry for an abouttopic:art|physics query.

...or maybe it's possible to use max for abouttopic:art|physics but sum for abouttopic:art abouttopic:physics? That seems the most intuitive to me, and independent keywords are summed anyway, right?

@Tgr -- I think I prefer "a mix of very typical art articles and very typical physics articles". I think that trying to do the "AND" is more fancy than our users will realize or benefit from. If we somehow do it easily, that's a bonus -- but I wouldn't want us to try to actually be able to find articles that are related to both, say, physics and fashion.

@Halfak may have ideas or insight on how to tackle this whole situation.

Cool, that sounds a lot less effort to maintain than per-topic weights. If defining a custom function is easy, maybe we could use the max of the sigmoids instead of the sum? I think that's closer to how topic search is intended to work (if I check that I'm interested in art and physics, I would want articles that are about art or physics, not necessarily both).

Getting to Art OR Physics, without allowing one to dominate the other is a bit difficult. I'm having trouble thinking of a math function we could apply to make the scores comparable without having prior knowledge such as the max score for each side. For more background see https://cwiki.apache.org/confluence/display/LUCENEJAVA/ScoresAsPercentages

If we don't want Art to influence Physics, or vice-versa, the most direct route might be to issue two separate queries and merge the result sets ourselves. On the upside, we already calculate and cache morelike queries for individual pages both at the edge caches (n.b. we need to make sure the requests match how article recommendations are requested to use same cache) and inside cirrussearch, meaning making both requests in parallel and interleaving the results in javascript should often return results faster than issuing a query that has to go all the way to the search cluster.

@MMiller_WMF any thoughts about that? If you set arts + physics, which would you prefer? A mix of very typical art articles and very typical physics articles, or articles that are related to both?
Although I guess another consideration is that when we do something similar for ORES scores, we want to expose those as a search keyword, and there it would be pretty cool to be able if we could return Symmetry for an abouttopic:art|physics query.
...or maybe it's possible to use max for abouttopic:art|physics but sum for abouttopic:art abouttopic:physics? That seems the most intuitive to me, and independent keywords are summed anyway, right?

We can certainly define a variant that uses max(), the problem will be how the scores work out. If our top 3 results for morelikethis:art have scores 86, 85,84 and the top 3 scores for morelikethis:physics are 156, 148, 140, then it doesn't matter that we selected the highest scoring sub-query per-document, because in the end there are dozens of physics articles that score higher than the first art result. To go over the options:

  • morelikethis:A|B: Treat concatenation of A and B as a single document and determine the 25 "most important" words. Search for those words and require at least 30% of those words to exist in any result.
  • morelikethis:A morelikethis:B: Determine 25 words for each document separately. All result documents must match 30% of the words selected on both sides. Scores are the sum of each individual word that matched from all queries. The same word could be selected on both sides and will be counted twice.
  • proposed (syntax tbd) morelikethis_dismax:A|B: Determine 25 words for each document separately. Result documents only need to match a single sub-query. Final score is the sum of word matching scores for one of the two provided source articles. As mentioned if one of the sub-queries returns higher scores than the other, it will win. I would expect it to be rare that two different morelikethis queries return results in the same numerical range, it is heavily dependent on the words that are selected and how common they are in our corpus.

@Tgr -- I think I prefer "a mix of very typical art articles and very typical physics articles". I think that trying to do the "AND" is more fancy than our users will realize or benefit from. If we somehow do it easily, that's a bonus -- but I wouldn't want us to try to actually be able to find articles that are related to both, say, physics and fashion.

Unfortunately the fancy part is creating a balanced OR statement, the AND case seems more tractable. As mentioned above it sounds like the best course of action for your desired results is to issue two separate search queries and display some subset of both result sets.

Notes on testing morelikethis variations:

Experimenting with this is a bit verbose but here are some basic instructions, feel free to setup some time and we could work through this together and find something reasonable.

  • take the __main__.query object from https://en.wikipedia.org/wiki/?search=morelikethis:Art|Physics&cirrusDumpQuery
    • The highlight object can be dropped, it's not necessary for current needs and is quite verbose.
    • The rescore object can be dropped, or could be retained (and possible tuned) if we want to include page popularity adjustments in the final score.
  • Submit the query to our replica in cloud (only accessible from wmf cloud instances) by putting the query in a file (here named test.q) and running the following:
curl -XGET -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:8243/enwiki_content/page/_search -d @test.q | jq '.hits.hits | map(._source.title)'

A few example queries:

  • P10306: Currently deployed morelikethis:Art|Physics for enwiki prod. Produces a single set of words for both source documents and searches for them.
  • P10307: Currently deployed morelikethis:Art morelikethis:Physics. Produces two sets of words and separately adds the scores together. All results must match 30% of the selected words from each set.
  • P10308: max(morelikethis:Art, morelikethis:Physics). Scores each word set and chooses per-article the word set that has the highest score. Because the physics scores are so much higher this is still basically just physics results
  • P10309: max(sigmoid(morelikethis:Art), sigmoid(morelikethis:Physics)). Because sigmoid is a monotonic function (it will not change the order) this is the same results as above. At a high level, instead of max(120, 80) we get max(0.95, 0.91).
Tgr added a comment.EditedThu, Feb 13, 1:15 AM

Thanks for the detailed writeup and instructions @EBernhardson!

Unfortunately the fancy part is creating a balanced OR statement, the AND case seems more tractable. As mentioned above it sounds like the best course of action for your desired results is to issue two separate search queries and display some subset of both result sets.

We do that currently (do multiple queries and interleave the results), but if the user selects a lot of topics it can result in a hundred separate queries, which blows up.

Luckily, I think this problem goes away now that we'll apply the threshold on the score upload side. Since for the Growth use case we only care about filtering to a set of tasks with a reasonably good match to the selected topics, and not so much about sorting up the best matches within those topics, we can just use a single similarity search with sort=random - no hundred separate queries (which I imagine would be problematic even if they would be sub-queries within a single ES query), no bias in the results, but still an acceptable match quality (assuming the thresholds discussed in T244297 work out well).