Page MenuHomePhabricator

Bump threshold for confidence score on link recommendation service suggestions
Open, MediumPublic

Description

The default threshold for generating a link suggestion is 0.5. We can consider raising this to 0.6 or 0.7. That would have the following effects:

  • The suggestions presented to the end user will have a higher likelihood of being good quality links, and will be less likely to be reverted.
  • For each article, the link recommendation service will identify fewer phrases as link suggestions (e.g. instead of 5 phrases, it might find 1 or 2).
    • It's hard to say how many fewer suggestions we would get per article. If we wanted to find out, we could write a fairly straightforward script to iterate over cached link recommendations in the database and gather statistics about the confidence score for each suggestion.
  • Because we have a minimum threshold of two suggestions for an article to be considered as a candidate link recommendation task, there will be fewer articles in the task queue, and/or it will take longer to repopulate the task queue for each wiki.

Acceptance Criteria

  1. The threshold for link suggestions is set at a higher value: 0.6
  2. Run revalidateLinkRecommendations.php on the affected wikis
Completion checklist

Functionality

  • The patches have been code reviewed and merged
  • The task passes its acceptance criteria

Engineering

  • There are existing and passing unit/integration tests
  • Tests for every involved patch should pass
  • Coverage for every involved project should have improved or stayed the same

Design & QA

  • If the task is UX/Design related: it must be reviewed and approved by the UX/Design team
  • Must be reviewed and approved by Quality Assurance.

Documentation

  • Related and updated documentation done where necessary
  • Announce it where needed (mostly in the Growth Newsletter)

Event Timeline

I think we should consider moving forward with this as a way to address some of the Patroller concerns about Add a Link.

My main question is if we should simply decide on the confidence score for all wikis, configure automatically based on wiki size, or make this easier to configure on Growth Configuration. It might be nice for a community that is seeing too many poor "Add a Link" edits to be able to update the settings to a higher confidence score. But it could also lead to few suggestions if the confidence score is set too high.

I think we should consider moving forward with this as a way to address some of the Patroller concerns about Add a Link.

My main question is if we should simply decide on the confidence score for all wikis, configure automatically based on wiki size, or make this easier to configure on Growth Configuration. It might be nice for a community that is seeing too many poor "Add a Link" edits to be able to update the settings to a higher confidence score. But it could also lead to few suggestions if the confidence score is set too high.

We can set it per-wiki now by modifying MediaWiki:NewcomerTasks.json and setting minimumLinkScore: 0.6 in the JSON. I am not sure that we should add this to Special:EditGrowthConfig as it is not especially intuitive to users and also takes time to take effect.

It's probably safe enough to simply bump the default to 0.6 across all wikis and keep an eye on task pool sizes at https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1&from=now-7d&to=now&forceLogin. If that looks good, we could then further increase to 0.7.

We could also write a script to get a breakdown for each wiki of where each cached suggestion currently falls in the threshold limits (e.g. 20% of suggestions are within 0.5-0.6, 30% are within 0.6-0.7 range, etc), but I am not sure that this would be especially helpful in predicting the impact of changing the default to 0.6 or 0.7.

Thanks for the info! Ok, it sounds like we should bump to 0.6 and monitor. Then we could consider increasing to 0.7, or simply reserve that as an option for wikis that request further improvements to the algorithm.

We would need to run revalidateLinkRecommendations.php on the affected wikis, otherwise it will take forever for the change to take effect. We should probably add a validation option to the script where it checks the "cheap" tasktype properties (link score, min links per task), maybe even updates the recommendation by filtering out below-threshold links if there enough links to do that.

@KStoller-WMF is this task something that we should prioritize doing in the next week or two?

It's not urgent, but I agree this is a task that we should work on soon.

Change 832639 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] LinkRecommendationTaskType: Raise score threshold to 0.6

https://gerrit.wikimedia.org/r/832639

Change 832639 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] LinkRecommendationTaskType: Raise score threshold to 0.6

https://gerrit.wikimedia.org/r/832639

Etonkovidova closed this task as Resolved.EditedMon, Jan 9, 11:03 PM
Etonkovidova added a subscriber: Etonkovidova.

For quite few deploymentsno regression was noticed in regards to pool size of suggested links and any other regression issues. I looked at several wikis at e.g. https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1&from=now-30d&to=now&viewPanel=31 - there are some wikis that have declining pool size but the count of tasks is still sufficiently high; and several wikis have recovered their drop in the number of tasks.

Don't we want to run the revalidate script, though? Some (maybe most) tasks probably still have the old confidence score.

kostajh moved this task from QA to Ready for Development on the Growth-Team (Current Sprint) board.

Don't we want to run the revalidate script, though? Some (maybe most) tasks probably still have the old confidence score.

Yeah, that was the second checkmark in the task description.

I would also be interested in a maintenance script that could pull the cached entries and provide some aggregate data to us about the metadata for the cache entries, like the confidence score, the dataset used to generate the recommendation, number of links, etc.