Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Tgr
	Jan 10 2020, 11:09 PM

Description

If you search for morelikethis:Art|Physics, you get all physics results: morelikethis gets a representative set of words from those articles and assigns some weight to them based on their frequency in the wiki, and there is no reason for those weights to be equal. If we want to show a somewhat diverse mix of tasks, we need a way to equalize the score contribution of different topics. (Not sure how many topic pairs are affected by this problem though. It is obviously language-dependent, and probably depends on the difficulty filter settings as well.)

ORES drafttopic search might have similar issues (the current plan is to implement it by using morelike search on the drafttopic parameters), so while fulltext-based morelike search might not be around for long, experimenting with solutions for this problem probably does have lasting value.

Details

Subject	Repo	Branch	Lines +/-
Newcomer tasks: Make a separate search query for every topic	mediawiki/extensions/GrowthExperiments	master	+87 -33
Newcomer tasks: Set search sort to random for ORES based topics	mediawiki/extensions/GrowthExperiments	wmf/1.35.0-wmf.22	+4 -2
Newcomer tasks: Set search sort to random for ORES based topics	mediawiki/extensions/GrowthExperiments	master	+4 -2
Newcomer tasks: Make a separate search query for every topic	mediawiki/extensions/GrowthExperiments	wmf/1.35.0-wmf.14	+87 -33

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Rileych	T240517 [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics)
		Open		None	T242476 Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others

Event Timeline

Tgr created this task.Jan 10 2020, 11:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2020, 11:09 PM

See T240512#5793805 for more background.

One option to address this would be T242492: Add weight parameter to morelikethis CirrusSearch feature.

Another option is to search for each topic separately and then interleave, as Kosta suggested elsewhere.

Change 563810 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/563810

gerritbot added a project: Patch-For-Review.Jan 13 2020, 6:51 AM

Change 563810 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/563810

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.15; 2020-01-14).Jan 13 2020, 12:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jan 13 2020, 12:10 PM

With the patch merged, we now search for each topic separately, and interleave the results (and then randomize the order of the whole thing), so this is not an issue. I think it would be good to return to doing everything in a limited number of searches in the future, in which case it will become an issue again.

MMiller_WMF added a parent task: T238608: [EPIC] Growth: Newcomer tasks 1.1.0 (topic matching).Jan 13 2020, 11:28 PM

MMiller_WMF added a project: Growth-Team.

Change 564162 had a related patch set uploaded (by Catrope; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/564162

gerritbot added a project: Patch-For-Review.Jan 14 2020, 12:08 AM

Change 564162 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/564162

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.14; 2020-01-07); removed MW-1.35-notes (1.35.0-wmf.15; 2020-01-14).Jan 14 2020, 1:00 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 14 2020, 1:10 AM

MMiller_WMF added a parent task: T240517: [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics).Jan 15 2020, 6:38 PM

MMiller_WMF removed a parent task: T238608: [EPIC] Growth: Newcomer tasks 1.1.0 (topic matching).

MMiller_WMF moved this task from Inbox to Needs Discussion on the Growth-Team board.

MMiller_WMF subscribed.Jan 22 2020, 5:43 AM

Tgr mentioned this in T242560: Newcomer tasks: task suggestions fail because of search queries exceeding length limits.Jan 23 2020, 3:10 AM

There are two main options I can think of for merging essentially two separate morelikethis queries trying to give them equal weight:

Use a weight parameter per-query. This is plausible, but must be tuned on a per-query basis.
Normalize each score to [0,1) with a sigmoid-like function and sum the two. By normalizing and adding we prefer results that score highly for both queries. For this task some testing suggests a low midpoint and a steep slope on the sigmoid will do best, as the high-performing results will generally score between [0.8,1), meaning our best results have to have scored well in both, and not just happen to match a word or two.

The sigmoid used is x^a / (x^a + m^a), where a = steepness of the sigmoid, and m = midpoint, or the value of x where result == 0.5. Highly suggest plugging values into a graphic calculator (desmos or similar) to understand what this does to the scores. A few different result sets for different parameters is in the table below. A result set like mathematical beauty, la femme au cheval, symmetry wouldn't be great, but better than whats returned today. I would additionally worry a bit that the parameters that return plausible values for this particular query may not work well on other results. We essentially need a larger set of test queries that this current one to evaluate from.

mid	a	1	2	3
50	0.8	Branches of physics	Mechanics	History of physics
50	3	Mechanics	History of physics	Classical Mechanics
50	5	Mechanics	Classical mechanics	History of physics
50	8	Goethean science	Light	Natural science
25	0.8	History of physics	Branches of physics	Mechanics
25	3	Mathematical beauty	La Femme au Cheval	Symmetry
25	5	Mathematical beauty	La Femme au Cheval	Symmetry
25	8	La Femme au Cheval	Mathematical beauty	Symmetry
15	0.8	History of physics	Mechanics	Branches of physics
15	3	La Femme au Cheval	Mathematical beauty	Symmetry
15	5	Mathematical beauty	La Femme au Cheval	Symmetry
15	8	La Femme au Cheval	Mathematical beauty	Symmetry
5	0.8	History of physics	Natural science	Mechanics
5	3	La Femme au Cheval	Symmetry	Mathematical beauty
5	5	La Femme au Cheval	Mathematical beauty	Symmetry
5	8	La Femme au Cheval	Ernst Wilhelm von Brucke	Geometry

Cool, that sounds a lot less effort to maintain than per-topic weights. If defining a custom function is easy, maybe we could use the max of the sigmoids instead of the sum? I think that's closer to how topic search is intended to work (if I check that I'm interested in art and physics, I would want articles that are about art or physics, not necessarily both).

@MMiller_WMF any thoughts about that? If you set arts + physics, which would you prefer? A mix of very typical art articles and very typical physics articles, or articles that are related to both?

Although I guess another consideration is that when we do something similar for ORES scores, we want to expose those as a search keyword, and there it would be pretty cool to be able if we could return Symmetry for an abouttopic:art|physics query.

...or maybe it's possible to use max for abouttopic:art|physics but sum for abouttopic:art abouttopic:physics? That seems the most intuitive to me, and independent keywords are summed anyway, right?

@Tgr -- I think I prefer "a mix of very typical art articles and very typical physics articles". I think that trying to do the "AND" is more fancy than our users will realize or benefit from. If we somehow do it easily, that's a bonus -- but I wouldn't want us to try to actually be able to find articles that are related to both, say, physics and fashion.

@Halfak may have ideas or insight on how to tackle this whole situation.

In T242476#5830864, @Tgr wrote:

Cool, that sounds a lot less effort to maintain than per-topic weights. If defining a custom function is easy, maybe we could use the max of the sigmoids instead of the sum? I think that's closer to how topic search is intended to work (if I check that I'm interested in art and physics, I would want articles that are about art or physics, not necessarily both).

Getting to Art OR Physics, without allowing one to dominate the other is a bit difficult. I'm having trouble thinking of a math function we could apply to make the scores comparable without having prior knowledge such as the max score for each side. For more background see https://cwiki.apache.org/confluence/display/LUCENEJAVA/ScoresAsPercentages

If we don't want Art to influence Physics, or vice-versa, the most direct route might be to issue two separate queries and merge the result sets ourselves. On the upside, we already calculate and cache morelike queries for individual pages both at the edge caches (n.b. we need to make sure the requests match how article recommendations are requested to use same cache) and inside cirrussearch, meaning making both requests in parallel and interleaving the results in javascript should often return results faster than issuing a query that has to go all the way to the search cluster.

@MMiller_WMF any thoughts about that? If you set arts + physics, which would you prefer? A mix of very typical art articles and very typical physics articles, or articles that are related to both?

Although I guess another consideration is that when we do something similar for ORES scores, we want to expose those as a search keyword, and there it would be pretty cool to be able if we could return Symmetry for an abouttopic:art|physics query.

...or maybe it's possible to use max for abouttopic:art|physics but sum for abouttopic:art abouttopic:physics? That seems the most intuitive to me, and independent keywords are summed anyway, right?

We can certainly define a variant that uses max(), the problem will be how the scores work out. If our top 3 results for morelikethis:art have scores 86, 85,84 and the top 3 scores for morelikethis:physics are 156, 148, 140, then it doesn't matter that we selected the highest scoring sub-query per-document, because in the end there are dozens of physics articles that score higher than the first art result. To go over the options:

morelikethis:A|B: Treat concatenation of A and B as a single document and determine the 25 "most important" words. Search for those words and require at least 30% of those words to exist in any result.
morelikethis:A morelikethis:B: Determine 25 words for each document separately. All result documents must match 30% of the words selected on both sides. Scores are the sum of each individual word that matched from all queries. The same word could be selected on both sides and will be counted twice.
proposed (syntax tbd) morelikethis_dismax:A|B: Determine 25 words for each document separately. Result documents only need to match a single sub-query. Final score is the sum of word matching scores for one of the two provided source articles. As mentioned if one of the sub-queries returns higher scores than the other, it will win. I would expect it to be rare that two different morelikethis queries return results in the same numerical range, it is heavily dependent on the words that are selected and how common they are in our corpus.

In T242476#5846729, @MMiller_WMF wrote:

@Tgr -- I think I prefer "a mix of very typical art articles and very typical physics articles". I think that trying to do the "AND" is more fancy than our users will realize or benefit from. If we somehow do it easily, that's a bonus -- but I wouldn't want us to try to actually be able to find articles that are related to both, say, physics and fashion.

Unfortunately the fancy part is creating a balanced OR statement, the AND case seems more tractable. As mentioned above it sounds like the best course of action for your desired results is to issue two separate search queries and display some subset of both result sets.

Notes on testing morelikethis variations:

Experimenting with this is a bit verbose but here are some basic instructions, feel free to setup some time and we could work through this together and find something reasonable.

take the __main__.query object from https://en.wikipedia.org/wiki/?search=morelikethis:Art|Physics&cirrusDumpQuery
- The highlight object can be dropped, it's not necessary for current needs and is quite verbose.
- The rescore object can be dropped, or could be retained (and possible tuned) if we want to include page popularity adjustments in the final score.
Submit the query to our replica in cloud (only accessible from wmf cloud instances) by putting the query in a file (here named test.q) and running the following:

curl -XGET -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:8243/enwiki_content/page/_search -d @test.q | jq '.hits.hits | map(._source.title)'

A few example queries:

P10306: Currently deployed morelikethis:Art|Physics for enwiki prod. Produces a single set of words for both source documents and searches for them.
P10307: Currently deployed morelikethis:Art morelikethis:Physics. Produces two sets of words and separately adds the scores together. All results must match 30% of the selected words from each set.
P10308: max(morelikethis:Art, morelikethis:Physics). Scores each word set and chooses per-article the word set that has the highest score. Because the physics scores are so much higher this is still basically just physics results
P10309: max(sigmoid(morelikethis:Art), sigmoid(morelikethis:Physics)). Because sigmoid is a monotonic function (it will not change the order) this is the same results as above. At a high level, instead of max(120, 80) we get max(0.95, 0.91).

Tgr mentioned this in T238171: Newcomer tasks: slow loading time for suggested edit module.Feb 11 2020, 12:52 AM

Thanks for the detailed writeup and instructions @EBernhardson!

In T242476#5849920, @EBernhardson wrote:

Unfortunately the fancy part is creating a balanced OR statement, the AND case seems more tractable. As mentioned above it sounds like the best course of action for your desired results is to issue two separate search queries and display some subset of both result sets.

We do that currently (do multiple queries and interleave the results), but if the user selects a lot of topics it can result in a hundred separate queries, which blows up.

Luckily, I think this problem goes away now that we'll apply the threshold on the score upload side. Since for the Growth use case we only care about filtering to a set of tasks with a reasonably good match to the selected topics, and not so much about sorting up the best matches within those topics, we can just use a single similarity search with sort=random - no hundred separate queries (which I imagine would be problematic even if they would be sub-queries within a single ES query), no bias in the results, but still an acceptable match quality (assuming the thresholds discussed in T244297 work out well).

@Tgr -- since you said that this goes away with the score thresholds, should we resolve this task? Will we expect this to work nicely with the ORES models next week?

It won't work out of the box, but hopefully we can switch back to random sorting and rely on the ORES thresholds to keep the results sane (which is a functionality we did not have for morelike). That's a trivial amount of work but will need some testing by the ambassadors. (That could probably happen now, just search for a topic in Special:Search and append &sort=random to the resulting URL.)

In any case, we are still doing a separate search by topic, so the only ill effect is that for topics that consist of multiple ORES labels, one of those labels will dominate the others, so I don't think it would be problematic to deploy ORES search without the change above.

Okay. I'm putting this in Ready for Development for us to remember to think about after we deploy ORES.

MMiller_WMF moved this task from Incoming to Ready for Development on the Growth-Team (Sprint 0 (Growth Team)) board.Feb 28 2020, 12:36 AM

kostajh claimed this task.Mar 3 2020, 12:47 PM

kostajh moved this task from Ready for Development to In Progress on the Growth-Team (Sprint 0 (Growth Team)) board.

&sort=random doesn't seem to work with Special:Search, but using the ApiSandbox, the random sort order seems to work fine https://cs.wikipedia.org/wiki/Speci%C3%A1ln%C3%AD:API_p%C3%ADskovi%C5%A1t%C4%9B#action=query&format=json&list=search&srsearch=articletopic%3Amusic%7Csports&srsort=random

Change 576347 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Set search sort to random for ORES based topics

https://gerrit.wikimedia.org/r/576347

gerritbot added a project: Patch-For-Review.Mar 3 2020, 2:31 PM

In T242476#5936592, @kostajh wrote:

&sort=random doesn't seem to work with Special:Search

Works fine for me. E.g. https://cs.wikipedia.org/w/index.php?search=articletopic%3Astem vs. https://cs.wikipedia.org/w/index.php?search=articletopic%3Astem&sort=random

Hmm, I guess I was just unlucky or didn't look closely enough at the results; when I tested with articletopic=music|sports and sort=random on cswiki it seemed that the last keyword would predominate (and exclude the other topic). But when I tried to reproduce that on enwiki just now I wasn't able to.

kostajh moved this task from In Progress to Code Review on the Growth-Team (Sprint 0 (Growth Team)) board.Mar 4 2020, 8:32 PM

Change 576347 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Set search sort to random for ORES based topics

https://gerrit.wikimedia.org/r/576347

Change 577001 had a related patch set uploaded (by Gergő Tisza; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.22] Newcomer tasks: Set search sort to random for ORES based topics

https://gerrit.wikimedia.org/r/577001

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.24; 2020-03-17); removed MW-1.35-notes (1.35.0-wmf.14; 2020-01-07).Mar 5 2020, 12:02 AM

Change 577001 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.22] Newcomer tasks: Set search sort to random for ORES based topics

https://gerrit.wikimedia.org/r/577001

Mentioned in SAL (#wikimedia-operations) [2020-03-05T00:41:43Z] <tgr@deploy1001> Synchronized php-1.35.0-wmf.22/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/SearchStrategy/SearchStrategy.php: SWAT: [[gerrit:577001|Newcomer tasks: Set search sort to random for ORES based topics (T242476)]] (duration: 01m 04s)

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.22; 2020-03-03); removed MW-1.35-notes (1.35.0-wmf.24; 2020-03-17).Mar 5 2020, 1:00 AM

Maintenance_bot removed a project: Patch-For-Review.Mar 5 2020, 1:11 AM

Kosta's patch is deployed, I have a slightly different patch in the works.

Tgr mentioned this in T245368: Newcomer tasks: evaluate new ORES topic models.Mar 5 2020, 5:11 AM

Tgr mentioned this in T240201: Performance review of GrowthExperiments extension, Special:Homepage Suggested Edits module.Mar 9 2020, 10:57 PM

@Tgr shall we drop this out of the current sprint and maybe also unassign?

Woah, this has been stuck in the sprint for a while! Thanks for catching.

Separate queries for each topic is still a performance slog and something that would be nice to fix (although that would involve figuring out more than just the weight issue). Definitely should not be in the sprint though.

MBinder_WMF added a project: Growth-Team-Filtering.Apr 15 2021, 6:51 PM

RHo mentioned this in T297117: Newcomer tasks: Provide more granular/advanced topic filtering for people with specific topics.Dec 6 2021, 2:43 PM

Newcomer tasks: when selecting multiple topics, one topic should not dominate over the othersOpen, Needs TriagePublicActions