Page MenuHomePhabricator

Newcomer tasks: Morelike backend for topic matching
Closed, ResolvedPublic

Description

Because the timeline to use ORES models (ticketed here) for newcomer tasks is longer than initially planned, we want to roll out an initial version that uses the "morelike" algorithm first prototyped in T231506: Newcomer tasks: prototype topic matching.

We need to set up the ability to back the interface with morelike results, drawing on the articles lists that the ambassadors generated in T233465: Newcomer tasks: article configurations for topics.


@kostajh's original task description

This task proposes that we build the backend for topic matching that uses the "morelike" search we experimented with, and eventually declined to pursue in favor of the ORES backed solution.

I'm proposing this task for a few reasons:

  1. Implementing the morelike backend is really straightforward based on our current code
  2. Having a backend in place allows us to build the topic filter widgets using actual data rather than faking it
  3. Adding the backend seems prudent while we sort out all of the un/known un/knowns with the ORES approach
  4. Even once the ORES backed approach is ready, it might be interesting to pursue as a variant test for some period of time
  5. The most difficult and time consuming part of this, which is gathering articles to use for each "topic", was already done by the ambassadors for our target wikis (except for viwiki, they'd have to do it)

Details

Related Gerrit Patches:
mediawiki/extensions/GrowthExperiments : wmf/1.35.0-wmf.14Suggested Edits: Use classic_noboostlinks for morelike query
mediawiki/extensions/GrowthExperiments : wmf/1.35.0-wmf.14Newcomer tasks: Log search errors in task backend
mediawiki/extensions/GrowthExperiments : wmf/1.35.0-wmf.14Add test for LocalSearchTaskSuggester
mediawiki/extensions/GrowthExperiments : masterSuggested Edits: Use classic_noboostlinks for morelike query
mediawiki/extensions/GrowthExperiments : wmf/1.35.0-wmf.14Newcomer tasks: Expose task type / topic set in API parameter info
mediawiki/extensions/GrowthExperiments : masterNewcomer tasks: Expose task type / topic set in API parameter info
mediawiki/extensions/GrowthExperiments : masterNewcomer tasks: Log search errors in task backend
mediawiki/extensions/GrowthExperiments : masterNewcomer tasks: Add test for API parameter info
mediawiki/extensions/GrowthExperiments : masterAdd test for LocalSearchTaskSuggester
mediawiki/extensions/GrowthExperiments : masterAdd morelike-based topic matching functionality to the backend

Related Objects

Event Timeline

kostajh created this task.Dec 11 2019, 10:03 PM
Tgr added a comment.Dec 11 2019, 10:21 PM

Also, can be used easily on a development/test setup with local search (just requires setting up CirrusSearch + some config), while ORES would probably be a pain to set up for e.g. the beta cluster wikis. (Also also if we want to have selenium tests for filtering at some point, all involved services need to be local for that configuration.)

MMiller_WMF updated the task description. (Show Details)Dec 18 2019, 2:05 AM

We definitely want to do this, so I've edited the task description. I also created T241021: Newcomer tasks: article configurations for topics (viwiki) so that we can have article lists for Vietnamese Wikipedia.

Tgr claimed this task.Dec 19 2019, 3:44 AM

@MMiller_WMF do we plan to use the same topic labels (e.g. the ones defined here https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Newcomer_tasks/Prototype/topics/cs.json) when we switch to the ORES backend (T240517)? If so, are we going to have a mapping of the topics we show in the UI and the topics that ORES provides?

@kostajh -- for this initial morelike version, I think we should stick with the same topic labels that we already have in the JSON. They will need to change when we switch to ORES because the ORES taxonomy is in the process of changing via T240286: Re-train English Wikipedia topic model using new WikiProject Taxonomy.

That new ORES taxonomy will probably have something like 60 leaf nodes, and after it is settled down, I will list the logic for how we should combine those nodes into the topics we actually want to display. This is probably a good time to re-confirm that we'll be able to do that: store the fully detailed ORES taxonomy in elasticsearch and then have our own logic in our feature to do something like "when the user selects the 'Africa geography' topic in the UI, return articles that have a score of above 0.95 on 'East African geography', 'North African geography', 'West African geography', or 'South African geography' in ORES".

How does this sound?

Change 559646 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] [WIP] Add morelike-based topic matching functionality to the backend

https://gerrit.wikimedia.org/r/559646

Tgr added a comment.EditedJan 3 2020, 3:07 AM

I'm moving this to code-review because the patch does what it says, and 95% of it is generic topic matching machinery which we'll need in any case (the morelike logic itself is super simple).

But I've run into a problem: when using the morelikethis: search, it seems the results are not in random order anymore, so every user would get the same search results. (Which makes sense - morelike is basically a sorting algorithm so it probably overrides other sort criteria.) I'll talk with the Search team and see what can be done about that.

@Tgr -- I agree that it is important for the results to come back randomly so that newcomers are not all working on the same articles. Would it be possible to randomize them in a separate step after they are returned? How much slower would that cause things to be? Or would it be possible to just calculate all these lists (topic x difficulty) in advance and randomize them? Maybe @kostajh thought about this when he made the morelike prototype in T231506: Newcomer tasks: prototype topic matching.

But I've run into a problem: when using the morelikethis: search, it seems the results are not in random order anymore, so every user would get the same search results. (Which makes sense - morelike is basically a sorting algorithm so it probably overrides other sort criteria.) I'll talk with the Search team and see what can be done about that.

There's not much to be done. It returns results based on a score, with more relevant to the top of the result set. I would suggest we take the results that are returned and randomize their order on the client side or in our API module.

It's also worth noting that unlike with a purely task type based set of tasks, where there might be 1500-2000 articles that a user could choose from, once topics are introduced the result set becomes much, much smaller. IIRC from the prototyping work, we end up with 20-150 articles in many cases, so collisions are going to be more likely (but probably not that common).

Tgr added a comment.Jan 3 2020, 9:29 PM

ElasticSearch always returns results based on a score, random sorting just makes it use a random scorer. But there are ways to combine multiple scorers (using the CirrusSearch interface, definitely, but I think even with the default search API, using an extension-defined search profile), it's also probably possible to randomize via boosting or rescoring. I'm just not sure if those are easy enough to be viable for something we don't mean to use for very long anyway.

The result set size is a fair point; although one of the more promising ways to speed up loading would be to use much smaller limits (fetching 10 tasks a time instead of 250), in which case fetching random results vs. randomizing on our side would still make a difference. But yeah, I guess in the short term giving the same 250 results in random order to everyone with the same filter set would be good enough, and if the total resultset is below 250 it's not possible to do better than that anyway.

Or would it be possible to just calculate all these lists (topic x difficulty) in advance and randomize them?

Possible but I would expect that to cause way more problems than what it would solve; with precaching comes all kinds of complexity around invalidation ("two hard things") and storage costs.

Tgr added a comment.Jan 3 2020, 9:35 PM

I'm just not sure if those are easy enough to be viable for something we don't mean to use for very long anyway.

On second thought, there really isn't anything in this problem that's specific to our temporary morelike backend plans. ORES topic matching will probably also be implemented as a score and consequently also have the same problem. (I think I even brought this up in the past somewhere, and then forgot about it.) T240559 has some discussion on that.

But yeah, I guess in the short term giving the same 250 results in random order to everyone with the same filter set would be good enough, and if the total resultset is below 250 it's not possible to do better than that anyway.

So, I think it might make sense to wait and see what the result sets look like once we have topic matching implemented before attempting to do anything further. If we find that there are a bunch of topics that end up having more than ~200 results then we could look at the multiple scorers approach you suggested.

when using the morelikethis: search, it seems the results are not in random order anymore, so every user would get the same search results.

Filed as T242057: Newcomer tasks: keep task list non-deterministic once topic matching is introduced.

Change 562407 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Log search errors in task backend

https://gerrit.wikimedia.org/r/562407

Change 559646 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add morelike-based topic matching functionality to the backend

https://gerrit.wikimedia.org/r/559646

Change 562667 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Add test for LocalSearchTaskSuggester

https://gerrit.wikimedia.org/r/562667

Change 562635 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Expose task type / topic set in API parameter info

https://gerrit.wikimedia.org/r/562635

Change 562653 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Add test for API parameter info

https://gerrit.wikimedia.org/r/562653

Tgr added a comment.Jan 8 2020, 1:42 AM

(None of the remaining patches block v1.1.)

Change 562667 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add test for LocalSearchTaskSuggester

https://gerrit.wikimedia.org/r/562667

I just +2'ed the patch to add task type / topic in API param info. To use, you can go to Special:ApiSandbox#action=query&format=json&list=growthtasks and click inside the task types and topics fields, you'll then be able to select valid options. Note that we still need to do some set up in beta (and prod, which this won't be live in until next Thursday) for this to work.

Change 562635 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Expose task type / topic set in API parameter info

https://gerrit.wikimedia.org/r/562635

Change 562653 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Add test for API parameter info

https://gerrit.wikimedia.org/r/562653

Change 562407 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Log search errors in task backend

https://gerrit.wikimedia.org/r/562407

@kostajh @Tgr @Catrope -- when I try this on Czech or Korean Beta, I am able to select a task type from the dropdown, but the topic field remains free text, with no choices. Does something need to be changed?

I'm here: https://cs.wikipedia.beta.wmflabs.org/wiki/Speci%C3%A1ln%C3%AD:API_p%C3%ADskovi%C5%A1t%C4%9B#action=query&format=json&list=growthtasks

@kostajh @Tgr @Catrope -- when I try this on Czech or Korean Beta, I am able to select a task type from the dropdown, but the topic field remains free text, with no choices. Does something need to be changed?
I'm here: https://cs.wikipedia.beta.wmflabs.org/wiki/Speci%C3%A1ln%C3%AD:API_p%C3%ADskovi%C5%A1t%C4%9B#action=query&format=json&list=growthtasks

You need to wait until the train deployment has finished tomorrow (Thursday), as Kosta said here:

Note that we still need to do some set up in beta (and prod, which this won't be live in until next Thursday) for this to work.

Tgr added a comment.Jan 8 2020, 6:54 PM

@MMiller_WMF I forgot to mention this in the discussion: the topic names are currently not translatable (they are read from the configuration page, so the four Growth wikis would show them in their own language, but if you visit cswiki and set your interface language to English for example, they'd still show up in Czech). We originally planned to switch to ORES topics very soon, and expected that to result in a different set of topics, so it seemed like a waste of translator time to translate the temporary names. If I misunderstood and / or you want to reconsider that, it would be a trivial change to add translatable strings.

@Tgr -- I think that is fine for now. The ambassadors will be doing the testing in their own languages anyway. I wonder whether Google Translate will be able to read the buttons and translate them in the browser.

Change 563287 had a related patch set uploaded (by Catrope; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Expose task type / topic set in API parameter info

https://gerrit.wikimedia.org/r/563287

@kostajh @Tgr -- could you tell me some of the specifications of the morelike backend, to help guide us testing it? Here are some questions I have, which may make you think of other things to list:

  • Is it using "classic_noboostlinks"?
  • When a topic has multiple seed articles, is it searching all of those titles together, or is it searching them separately and combining the results?
  • Is it returning all results? Or is it capped at some number?
Tgr added a comment.Jan 10 2020, 12:36 AM
  • Is it using "classic_noboostlinks"?

It uses the default settings (I couldn't figure it out from the source code what that might be), like it did before adding the morelike component.

  • When a topic has multiple seed articles, is it searching all of those titles together, or is it searching them separately and combining the results?

All together. Doing tasktypes x topics separate searches is not really manageable.

  • Is it returning all results? Or is it capped at some number?

Capped at 250. The API does return the number of total results, although we don't show it anywhere currently.

Change 563287 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Expose task type / topic set in API parameter info

https://gerrit.wikimedia.org/r/563287

Mentioned in SAL (#wikimedia-operations) [2020-01-10T00:45:08Z] <catrope@deploy1001> Synchronized php-1.35.0-wmf.14/extensions/GrowthExperiments/: Expose tasktype/topic API parameter info (T240512) (duration: 01m 01s)

kostajh removed Tgr as the assignee of this task.Jan 10 2020, 2:35 PM
kostajh moved this task from QA to Ready for Development on the Growth-Team (Current Sprint) board.

All together. Doing tasktypes x topics separate searches is not really manageable.

We do need to make a related adjustment to this, though, which is that if I search for "arts" and "philosophy" with "expand" task type, we should be performing two searches: hastemplate:"Pahýl část" morelike:"Umění"|"Malířství"|"Výtvarné umění"|"Múzická umění" and hastemplate:"Pahýl část" morelike:"Filosofie"|"Současná filosofie v Česku".

Instead, our backend does a query like hastemplate:"Pahýl část" morelike:"Umění"|"Malířství"|"Výtvarné umění"|"Múzická umění"|"Filosofie"|"Současná filosofie v Česku which is kind of like saying "Show me articles about art and philosophy" but per the spec for T238610: Newcomer tasks: include topics in intro overlay, we should be using OR logic.

I spotted this while building out the topic filter dialog in T238612

Moving this task back to ready for development.

As far as I can tell that's not how it works:

  • morelikethis:Art - 905K results, starting with The arts, Aesthetics, Work of art, The Story of Art, Feminist aesthetics, Visual arts, Primitivism, Fine art, Formalism (art), Neuroesthetics
  • morelikethis:Physics - 70K results, starting with Outline of physics, Branches of physics, Theoretical physics, Quantum mechanics, History of physics, Quantum gravity, Theory of everything, Outline of physical science, Mechanics, List of effects
  • morelikethis:Art|Physics - 1M results (so slightly more than the previous two together), starting with Branches of physics, Quantum mechanics, Classical mechanics, Interpretations of quantum mechanics, Mechanics, Theoretical physics, History of physics, Louis de Broglie, Outline of physics, Philosophy of physics (order does not matter, Physics|Art would give the same result). So this is OR as far as the full result set goes, but the topics are not weighted equally. If I had to guess, morelike search scores matches by inverse frequency, and there are less physics articles, so those are more valuable. Or maybe these physics entries are stocked more full of physics terms of art than art entries would be of art-specific words.
  • morelikethis:Art morelikethis:Physics - 63K results, starting with Branches of physics, Outline of physics, Theoretical physics, History of physics, Quantum mechanics, Mechanics, Classical mechanics, Outline of physical science, Natural science, Physicist. So this is AND, which makes sense as Cirrus generally ANDs search keywords together. The scoring probably shows the same effects as before.
  • morelikethis:Art OR morelikethis:Physics (60K results) - this just gives a bunch of results with a literal "or" in them. Which is surprising because the docs claim Cirrus supports OR (somewhat), but that does not seem to be the case here.

If we want more OR-like behavior then we'd have to go down to the ElasticSearch level, I think.

Tgr added a comment.Jan 10 2020, 10:34 PM

Erik pointed to the cirrusDumpResult flag. Using that, morelikethis:Art shows top scores around 85 and morelikethis:Physics around 150. So when the scores from those two search terms are added, the "art factor" ends up being relatively irrelevant.

Tgr added a comment.Jan 10 2020, 11:09 PM

As far as I can tell that's not how it works:

I asked about the precise logic. If you search for morelike:A|B|C|X|Y|Z, Cirrus will select the top 50 words that it thinks represent that article set best, and search for those words (OR-ed). With morelike:A|B|C morelike:X|Y|Z it will be (top 50 A/B/C words OR-ed) AND (top 50 X/Y/Z words OR-ed). So I think the current logic is more appropriate.

Filed T242476: Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others for looking into longer-term solutions.

Hmm, thanks for digging into this @Tgr. It might still be worth making the change I proposed so that we avoid the errors when your query is over 300 characters long, which on Czech happens with ~5 topics and the default filters enabled.

Tgr added a comment.Jan 12 2020, 11:26 PM

If we put them into completely separate search requests, we'll end up with task types x topics requests (that's up to 100 currently). That seems crippling, unless we maybe selectively omit some of them when too many options are checked, or limit how many options the user can check in the first place. (That said if we go for a Tuesday deploy, for the time being taking long is better than erroring out...) Filed T242560: Newcomer tasks: task suggestions fail because of search queries exceeding length limits so we have a dedicated task.

Change 563992 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] Suggested Edits: Use classic_noboostlinks for morelike query

https://gerrit.wikimedia.org/r/563992

Change 563992 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Suggested Edits: Use classic_noboostlinks for morelike query

https://gerrit.wikimedia.org/r/563992

Change 564159 had a related patch set uploaded (by Catrope; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Add test for LocalSearchTaskSuggester

https://gerrit.wikimedia.org/r/564159

Change 564160 had a related patch set uploaded (by Catrope; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Log search errors in task backend

https://gerrit.wikimedia.org/r/564160

Change 564163 had a related patch set uploaded (by Catrope; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Suggested Edits: Use classic_noboostlinks for morelike query

https://gerrit.wikimedia.org/r/564163

Change 564159 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Add test for LocalSearchTaskSuggester

https://gerrit.wikimedia.org/r/564159

Change 564160 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Log search errors in task backend

https://gerrit.wikimedia.org/r/564160

Change 564163 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Suggested Edits: Use classic_noboostlinks for morelike query

https://gerrit.wikimedia.org/r/564163

Etonkovidova closed this task as Resolved.Jan 23 2020, 5:23 PM
Etonkovidova claimed this task.