Page MenuHomePhabricator

Surfacing structured tasks: Populate Add Link suggestions for more articles
Closed, ResolvedPublic5 Estimated Story Points

Description

Background

The Add Link task pool is maintained by the refreshLinkRecommendations.php maintenance script that runs in the background. This script attempts to ensure at least 500 Add Link for each article topic. If this threshold isn't met for a given topic, it iterates over all articles in that topic and attempts to convert each into a Add Link suggestion, until the 500 limit is eventually reached.

Problem

Having 500 articles per topic is sufficient when viewing articles via Special:Homepage, as users are looking at a task queue in that context. However, it is not sufficient for the Surfacing Structured Edits project, where we want to invite users to Add Link editing from within the reading mode, as users can be reading any article. Within this task, we should ensure as many articles as possible have a link recommendation queued. In this task, we should do a trial at one wiki, and evaluate the successfulness of that model.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Urbanecm_WMF renamed this task from Create a proof of concept solution for populating Add Link suggestions for all articles to Populate Add Link suggestions for all articles.Oct 29 2024, 5:50 PM
Urbanecm_WMF renamed this task from Populate Add Link suggestions for all articles to Surfacing structured tasks: Populate Add Link suggestions for all articles.
KStoller-WMF moved this task from Inbox to Up Next (estimated tasks) on the Growth-Team board.
KStoller-WMF raised the priority of this task from Medium to High.Dec 4 2024, 5:49 PM

I'm pondering this task and how to optimize the logic of refreshing tasks beyond raising the limit of having 500 per topic. Some brainstorm-y thoughts:

  • can we somehow store the information that we did not find any suggestions for a page? Currently we just plain do nothing in that case and we then try again the next time around. It would be nice to be able to skip that extra work by somewhere (page prop? cirrus search weighted tag? GE-specific db entry? ...?) storing that for the combination of page-revisionID + community-config-hash (+ model number?) we did not have any suggestion, so we do not need to try again.
  • can we somehow intentionally request link-suggestions for articles without an articletopic? There was little point for that when showing tasks on the homepage, because there we have topics, but it might make sense when volunteers can discover articles with suggestions on their own
  • the limit of "don't show a task for a page that has been edit in the last X days" should be very simple at least for link-suggestions because there we do that already with self::FIELD_MIN_TIME_SINCE_LAST_EDIT => ExpirationAwareness::TTL_DAY, in LinkRecommendationTaskType and we just have to change that whatever value we want (either in the defaults or somehow in the specific settings)
  • currently, the script stops asking for more titles in a topic if it did not find any new link-suggestions in the last batch of 500 candidate titles. Maybe we could/should also do that differently, because a topic might not actually be exhausted at that point.
  • Would it make sense to track performance metrics in LinkRecommendationUpdater::evaluateTitle to get an idea how many articles we're filtering out for the various reasons?
KStoller-WMF renamed this task from Surfacing structured tasks: Populate Add Link suggestions for all articles to Surfacing structured tasks: Populate Add Link suggestions for more articles.Dec 7 2024, 11:42 PM
KStoller-WMF updated the task description. (Show Details)

I adjusted the title slightly to make it clear that we don't expect to have link suggestions for ALL articles. That's probably not a scalable approach, and there are many articles that are edit protected and newcomers should avoid. But ideally we are able to explore ways to populate more "add a link" suggestions.

I would suggest using Spanish Wikipedia as a pilot for this, as they are Growth's main pilot wikis included in the alpha test.

KStoller-WMF set the point value for this task to 5.Dec 10 2024, 5:18 PM

I'm pondering this task and how to optimize the logic of refreshing tasks beyond raising the limit of having 500 per topic. Some brainstorm-y thoughts:

  • can we somehow store the information that we did not find any suggestions for a page? Currently we just plain do nothing in that case and we then try again the next time around. It would be nice to be able to skip that extra work by somewhere (page prop? cirrus search weighted tag? GE-specific db entry? ...?) storing that for the combination of page-revisionID + community-config-hash (+ model number?) we did not have any suggestion, so we do not need to try again.

Seems like a good thing to do but not necessary beforehand to prove the impression rate of suggestion increases. I'd leave it as follow-up.

  • can we somehow intentionally request link-suggestions for articles without an articletopic? There was little point for that when showing tasks on the homepage, because there we have topics, but it might make sense when volunteers can discover articles with suggestions on their own

Maybe we could introduce different refresh strategies and run the script with each alternatively. I remember Peter Pelberg suggesting looking up suggestions by article pageviews, which makes sense in this case. In T307902: Assess database requirements for link recommendations reading entry point generating available suggestions for all articles was already considered, with the DB size boundaries being deemed as reasonable, would it make sense to fetch suggestions without any articletopic constraint randomly? Or by page id and store a cursor to the last processed batch? If we could target a smaller wiki than our current pilot wikis I'd give a try to it.

  • the limit of "don't show a task for a page that has been edit in the last X days" should be very simple at least for link-suggestions because there we do that already with self::FIELD_MIN_TIME_SINCE_LAST_EDIT => ExpirationAwareness::TTL_DAY, in LinkRecommendationTaskType and we just have to change that whatever value we want (either in the defaults or somehow in the specific settings)
  • currently, the script stops asking for more titles in a topic if it did not find any new link-suggestions in the last batch of 500 candidate titles. Maybe we could/should also do that differently, because a topic might not actually be exhausted at that point.

Agreed

  • Would it make sense to track performance metrics in LinkRecommendationUpdater::evaluateTitle to get an idea how many articles we're filtering out for the various reasons?

Agreed

I took a look at the task. It appears the number of minimum tasks per topic we want to have is (partially) Community configurable already. LinkRecommendationTaskType loads the data from Community Configuration, but SuggestedEditsSchema doesn't know about minimumTasksPerTopic. In other words, if minimumTasksPerTopic was defined in Community Configuration, GrowthExperiments would respect that value, but this can never happen, because minimumTasksPerTopic is not mentioned in the schema. This makes

The easiest path forward here could be to add minimumTasksPerTopic to the schema, and make a Community Configuration edit. That would also add the value to the configuration editor form, which might not be what we want (cc @KStoller-WMF for thoughts). Add Link already has options that are harder to understand (such as the link score), and adding new hard-to-understand config options doesn't seem like the best idea.

FWIW, this is the case for other Add Link related configuration options as well, such as FIELD_MIN_TIME_SINCE_LAST_EDIT (mentioned by @Michael in his comment above).

If we want to not expose this configuration via Community Configuration (and I'd say that would be reasonable), we might need to update AbstractDataConfigurationLoader::parseTaskTypesFromConfig to merge Community Configuration data from a different data source (server side config, maybe).

Regarding iterating over all articles rather than over topics, I didn't yet make my mind on whether that would be a good approach. It would definitely be reasonable if we want suggestions about all/most articles, but otherwise, the topic distribution (however imperfect) seems potentially useful wrt Special:Homepage users. Iterating through articles in random might be better approach (letting each script run compute N suggestions). In that case, we would probably need to figure out where to store "no recommendation".

The easiest path forward here could be to add minimumTasksPerTopic to the schema, and make a Community Configuration edit. That would also add the value to the configuration editor form, which might not be what we want (cc @KStoller-WMF for thoughts). Add Link already has options that are harder to understand (such as the link score), and adding new hard-to-understand config options doesn't seem like the best idea.

I'm torn on this. On one hand, I generally support taking the simplest path and empowering communities to make their own configuration decisions—unless there's a clear downside. On the other hand, adding minimumTasksPerTopic to the configuration form does introduce additional complexity. (I'm struggling to identify a clear community need for adjusting minimumTasksPerTopic, apart from cases where communities might want to limit the number of tasks available. Let me know if you can think of others).

My indecisive answer is that I'm not opposed to adding the minimumTasksPerTopic to the schema, especially if it's a lot easier. However I'm in support of taking the longer approach if others feel strongly here.

I'm also hesitant about the idea adding this to the CommunityConfiguration. Is the amount of tasks available really something that makes sense for the community to edit? To me that is something where the main trade-offs lie in how our infrastructure is utilized and that is not something where the community is the main stakeholder. So my suggestion would be to either change it in PHP config, or maybe even simpler, just change the default values in code.
But maybe I'm missing something?

On the other hand, FIELD_MIN_TIME_SINCE_LAST_EDIT, does sound like something that could make sense as part of community configuration. But maybe not as a link-specific task but as a configuration for all the suggested edits? It probably does also not make sense to send newcomers to copy-edit an article involved in an active edit-war.

I agree keeping minimumTasksPerTopic in community configuration doesn't really make sense, as the community is not the one (most) impacted by the change (the infrastructure is). I filled T383714: Move minimumTasksPerTopic from CommunityConfiguration to PHP configuration to change that. Once the config is moved, we would be able to adjust it for the pilot wikis.

Moving off-sprint, as this requires T383714 to be completed first.

Change #1113840 had a related patch set uploaded (by Cyndywikime; author: Cyndywikime):

[mediawiki/extensions/GrowthExperiments@master] Move link recommendation minimum tasks per topic to PHP configuration

https://gerrit.wikimedia.org/r/1113840

Change #1115791 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] [Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki

https://gerrit.wikimedia.org/r/1115791

Change #1115791 merged by jenkins-bot:

[operations/mediawiki-config@master] [Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki

https://gerrit.wikimedia.org/r/1115791

Mentioned in SAL (#wikimedia-operations) [2025-02-04T08:04:07Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1113984|Add configurable MinimumTasksPerTopic (T383714)]], [[gerrit:1115791|[Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki (T378527)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-04T08:13:17Z] <urbanecm@deploy2002> urbanecm, cyndywikime: Backport for [[gerrit:1113984|Add configurable MinimumTasksPerTopic (T383714)]], [[gerrit:1115791|[Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki (T378527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-04T08:30:04Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1113984|Add configurable MinimumTasksPerTopic (T383714)]], [[gerrit:1115791|[Growth] Increase minimum tasks per topic to 2000 for eswiki, frwiki (T378527)]] (duration: 25m 56s)

Mentioned in SAL (#wikimedia-operations) [2025-02-04T09:38:07Z] <urbanecm> mwmaint2002: Kill mediawiki_job_growthexperiments-refreshLinkRecommendations-s6[6640] to pick new config (T378527)

Urbanecm_WMF edited projects, added Growth-Team (Current Sprint); removed Growth-Team.

Not really blocked, but ongoing. We are now collecting more add link recommendations on frwiki and eswiki, and we will need to take a look at the charts to determine whether that should be done for more wikis.

I decided to split the work into separate tasks that would be more clear and easier to take action on:

As far as I know, there isn't really anything else that would need to be completed here. Closing as Resolved. Feel free to reopen if you think this needs further work (and/or more tasks).