Page MenuHomePhabricator

Rewrite refreshLinkRecommendations to not iterate through article topics
Open, In Progress, MediumPublic5 Estimated Story Points

Description

Background

As of writing, refreshLinkRecommendations iterates through all article topics and for each of them, it does the following steps:

  1. Requests a batch of random 500 articles belonging to that topic
  2. For each article, it attempts to generate an Add Link recommendation. If it succeeds, it adds it to the task pool.
  3. If at least one new task was generated in the last batch of 500 articles, it moves back to step 1. Otherwise, it considers the topic exhausted, and moves to the next topic in the list.

This is done under the assumption that if none of the 500 articles are worthy for inclusion, then there are simply no viable suggestions in that topic. However, that doesn't necessarily need to be true – we can be unlucky, and out of the 10,000 articles the topic has, we receive 500 that indeed do not have any recommendation (but the next batch would have). It is also done because we do not know which of the random batches is the last one.

Originally, this method was likely selected to control how many recommendations are in the task pool for each task. Unfortunately, this method does not really work, as each article can be (and usually is) in more than topic. If most articles in the africa topic are about notable Africans, then getting more tasks for the africa topic also means getting more tasks for the biografy topic. This results in significant differences between various topics. For example, the smallest non-empty task at eswiki has a single task (the architecture topic), while the biography topic has over 10k of articles (the threshold is set to 2k of articles per topic).

Problem

Within this task, we should implement a task pool refreshing logic that does not involve iterating across topics. We will need to both think about options and implement the final choice. Several options are available below.

Options
  1. Iterate over all articles ordered by their page ID
  2. Iterate over all articles randomly (for example, using the page_random column)
  3. Iterate over articles ordered by their last edit timestamp (perhaps excluding articles that were edited too recently)
  4. Something else

In all cases, we would need to introduce a new threshold (the desired total task pool size). A good starting value for that might be 500*<topic count> (we have 39 topics, so around 20,000 in total). Considering the recent bumping (T386248), it might make sense to set it to even more (50,000 or 100,000).

Iterating in a stable sort would allow us to simplify the code a lot (we would not need to worry about iterating over an article twice). However, it might skew the distribution of the task pool content in a particular manner, which is something we should attempt to avoid as much as possible.

Final solution

To be determined.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Urbanecm_WMF added a subscriber: KStoller-WMF.

Hi @KStoller-WMF, this is the task we discussed during Backlog Refinement. I'm moving it to Backlog for now, but maybe it is something we should consider for the next sprint as well? Curious to hear your thoughts.

KStoller-WMF moved this task from Backlog to Up Next (estimated tasks) on the Growth-Team board.

Let's move it to "Up Next" so we have a chance to discuss in estimation. I really like this idea, but I'm unclear on the effort of the various options.

KStoller-WMF set the point value for this task to 5.Feb 18 2025, 5:55 PM

My goal for now is to write a very quick Proof of Concept of my current idea, so that we can talk about that further.

Change #1121387 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] poc(AddLink): iterate through pages to refresh LinkRecommendations

https://gerrit.wikimedia.org/r/1121387

The Proof-of-Concept change poc(AddLink): iterate through pages to refresh LinkRecommendations is ready for thoughts on the overall approach. Does it make sense? Did I forget something fundamental? What concerns do you have? Do you have a different approach that I should consider?

Obviously, the change will need much more polishing, tests, etc.

I talked about this task with @Tgr yesterday, which gave me some new thoughts. One of the things we discussed is the duration of the script, as it might now be forced to go across all articles on a Wikipedia. For a very large project (such as enwiki), this might be a significant problem. This might be resolved by making the script only work on a set amount of articles per each execution (making it reasonably fast to finish), continuing where we left off on a following run. The handover point (between the two runs) might be stored in a cache (such as MainStash, which is DB backed).

This makes me think whether it wouldn't make sense to rewrite the script in a way that generates N random batches of articles, in each excludes any articles where we know Add Link recommendations don't exist and attempts to generate a recommendation for the rest. That wouldn't allow us to (easily) generate recommendations for all articles in a project, but it would help with not biasing the pool by article age that much.

Another thought Gergo had on this was to store the "no link recommendation" state in Search (as a weighed tag). That way, we would be able to (directly) search for tasks that neither have a recommendation nor a "no recommendation available" flag. Is that something we should consider in this round of rewriting?

Another thought Gergo had on this was to store the "no link recommendation" state in Search (as a weighed tag). That way, we would be able to (directly) search for tasks that neither have a recommendation nor a "no recommendation available" flag. Is that something we should consider in this round of rewriting?

A good idea that would make the current process more efficient, but still leaves the fundamental problem that we can't meaningful paginate through the entire search index, because we have a 10K limit on the offset.

This makes me think whether it wouldn't make sense to rewrite the script in a way that generates N random batches of articles, in each excludes any articles where we know Add Link recommendations don't exist and attempts to generate a recommendation for the rest. That wouldn't allow us to (easily) generate recommendations for all articles in a project, but it would help with not biasing the pool by article age that much.

True, but if our goal is to have recommendations for all pages that can have them, then this will keep having us fall frustratingly short of that goal.

One of the things we discussed is the duration of the script, as it might now be forced to go across all articles on a Wikipedia. For a very large project (such as enwiki), this might be a significant problem. This might be resolved by making the script only work on a set amount of articles per each execution (making it reasonably fast to finish), continuing where we left off on a following run. The handover point (between the two runs) might be stored in a cache (such as MainStash, which is DB backed).

That sounds doable. What would a sensible limit be? Currently, we have ca. 40 topics, and for some wikis we process at least 500 articles per topic (if available) and that seems to be fine. That would put us at 20 000 to be a known safe amount of articles to process. Does that sounds sensible? It probably also makes sense to track how long each invokation of the script takes per wiki.

T347603: Expose search_after in SearchEngine has some older discussion about the CirrusSearch limit.

Another thought Gergo had on this was to store the "no link recommendation" state in Search (as a weighed tag). That way, we would be able to (directly) search for tasks that neither have a recommendation nor a "no recommendation available" flag. Is that something we should consider in this round of rewriting?

A good idea that would make the current process more efficient, but still leaves the fundamental problem that we can't meaningful paginate through the entire search index, because we have a 10K limit on the offset.

Not necessarily. If we only process up to 10k pages at a time, and if we have the "no link recommendation" state in Search, then we should be able to (eventually) process the whole wiki by requesting 10k of random unprocessed articles at a time. We wouldn't be able to do that in a single execution of the script, but we don't really need to either.

This makes me think whether it wouldn't make sense to rewrite the script in a way that generates N random batches of articles, in each excludes any articles where we know Add Link recommendations don't exist and attempts to generate a recommendation for the rest. That wouldn't allow us to (easily) generate recommendations for all articles in a project, but it would help with not biasing the pool by article age that much.

True, but if our goal is to have recommendations for all pages that can have them, then this will keep having us fall frustratingly short of that goal.

If implemented alone, yes, I agree with that reservation. If implemented together with some of the other ideas (such as the Search one), it might be able to fulfil that goal just as well. That being said, I recognise adding the recommendations to Search (and ensuring they are reasonably in sync) would be more work than originally assumed here, and we might want to decide to be iterative here.

One of the things we discussed is the duration of the script, as it might now be forced to go across all articles on a Wikipedia. For a very large project (such as enwiki), this might be a significant problem. This might be resolved by making the script only work on a set amount of articles per each execution (making it reasonably fast to finish), continuing where we left off on a following run. The handover point (between the two runs) might be stored in a cache (such as MainStash, which is DB backed).

That sounds doable. What would a sensible limit be? Currently, we have ca. 40 topics, and for some wikis we process at least 500 articles per topic (if available) and that seems to be fine. That would put us at 20 000 to be a known safe amount of articles to process. Does that sounds sensible? It probably also makes sense to track how long each invokation of the script takes per wiki.

No objections to that figure, but I do think we should aim for shorter and more frequent scripts rather than the other way around. Something like 5k or 10k might be better from that perspective. But we also don't really need to do that change at this time.

Another thought Gergo had on this was to store the "no link recommendation" state in Search (as a weighed tag). That way, we would be able to (directly) search for tasks that neither have a recommendation nor a "no recommendation available" flag. Is that something we should consider in this round of rewriting?

A good idea that would make the current process more efficient, but still leaves the fundamental problem that we can't meaningful paginate through the entire search index, because we have a 10K limit on the offset.

Not necessarily. If we only process up to 10k pages at a time, and if we have the "no link recommendation" state in Search, then we should be able to (eventually) process the whole wiki by requesting 10k of random unprocessed articles at a time. We wouldn't be able to do that in a single execution of the script, but we don't really need to either.

This makes me think whether it wouldn't make sense to rewrite the script in a way that generates N random batches of articles, in each excludes any articles where we know Add Link recommendations don't exist and attempts to generate a recommendation for the rest. That wouldn't allow us to (easily) generate recommendations for all articles in a project, but it would help with not biasing the pool by article age that much.

True, but if our goal is to have recommendations for all pages that can have them, then this will keep having us fall frustratingly short of that goal.

If implemented alone, yes, I agree with that reservation. If implemented together with some of the other ideas (such as the Search one), it might be able to fulfil that goal just as well. That being said, I recognise adding the recommendations to Search (and ensuring they are reasonably in sync) would be more work than originally assumed here, and we might want to decide to be iterative here.

This assumes that most pages either have a recommendation or do not have one in a way that could be stored in the search index. I'm not sure if that this assumption is valid when considering to recently edited pages, pages with red-links pruned, disambiguation pages, protected pages, etc.
Further, this introduces another tag on the search index which could potentially become out-of-sync with the database in new, additional ways, which is the reason that we did not go for this solution in T382270: Store the fact that Add Link did not generate any recommendation for a page, don't try again.
Finally, I feel like, this would blow up the effort required for this task into the range of 8-13 at least.

One of the things we discussed is the duration of the script, as it might now be forced to go across all articles on a Wikipedia. For a very large project (such as enwiki), this might be a significant problem. This might be resolved by making the script only work on a set amount of articles per each execution (making it reasonably fast to finish), continuing where we left off on a following run. The handover point (between the two runs) might be stored in a cache (such as MainStash, which is DB backed).

That sounds doable. What would a sensible limit be? Currently, we have ca. 40 topics, and for some wikis we process at least 500 articles per topic (if available) and that seems to be fine. That would put us at 20 000 to be a known safe amount of articles to process. Does that sounds sensible? It probably also makes sense to track how long each invokation of the script takes per wiki.

No objections to that figure, but I do think we should aim for shorter and more frequent scripts rather than the other way around. Something like 5k or 10k might be better from that perspective. But we also don't really need to do that change at this time.

Mh, if we do not do this now, then by default the script would run until it has processed all the pages in the wiki. Which, as I understand this conversation, is undesirable. I'm fine with 5K just as well. Probably, we want this limit to be a cli-parameter, then we can adjust it as we see fit.

I've implemented a version that stores&gets the handover point in the MainObjectStash as suggested. My main quibble is that the expiry behavior/contract is not obvious from the on-wiki docs.

[...] it would help with not biasing the pool by article age that much. [...]

Idea: We could iterate through the table not by page_id but by page_random. (And we have an index for that)
That would take care of the bias.

If the bias is big enough of an issue, this should be relatively simple to implement. Though, we could also introduce that at a later stage.

Change #1121387 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] feat(AddLink): allow generating Link Recommendations for all pages

https://gerrit.wikimedia.org/r/1121387

Michael raised the priority of this task from Medium to High.Mar 11 2025, 10:48 AM
Michael moved this task from Code Review to Doing on the Growth-Team (Current Sprint) board.

Now that we have the functional change merged, we want to switch to it on next Monday, 2025-03-17, for eswiki and cswiki to see the results.

Moving this to doing to create the config patch.

Change #1126533 had a related patch set uploaded (by Michael Große; author: Michael Große):

[operations/mediawiki-config@master] Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations

https://gerrit.wikimedia.org/r/1126533

Change #1126533 merged by jenkins-bot:

[operations/mediawiki-config@master] Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations

https://gerrit.wikimedia.org/r/1126533

Mentioned in SAL (#wikimedia-operations) [2025-03-17T13:06:22Z] <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-17T13:10:17Z] <tgr@deploy2002> tgr, migr, anzx: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-17T13:19:49Z] <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]] (duration: 13m 27s)

Change #1128777 had a related patch set uploaded (by Michael Große; author: Michael Große):

[operations/mediawiki-config@master] Growth: enable new way of refreshing LinkRecommendations for pilots

https://gerrit.wikimedia.org/r/1128777

Change #1128777 merged by jenkins-bot:

[operations/mediawiki-config@master] Growth: enable new way of refreshing LinkRecommendations for pilots

https://gerrit.wikimedia.org/r/1128777

Mentioned in SAL (#wikimedia-operations) [2025-03-18T20:08:58Z] <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1128922|Edit check: set up the multi-check a/b test (T384372)]], [[gerrit:1127945|Enable VisualEditor EditCheck multi-check a/b test on test2wiki (T384372)]], [[gerrit:1128777|Growth: enable new way of refreshing LinkRecommendations for pilots (T386250)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-18T20:16:07Z] <tgr@deploy2002> migr, kemayo, tgr: Backport for [[gerrit:1128922|Edit check: set up the multi-check a/b test (T384372)]], [[gerrit:1127945|Enable VisualEditor EditCheck multi-check a/b test on test2wiki (T384372)]], [[gerrit:1128777|Growth: enable new way of refreshing LinkRecommendations for pilots (T386250)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-18T20:25:34Z] <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1128922|Edit check: set up the multi-check a/b test (T384372)]], [[gerrit:1127945|Enable VisualEditor EditCheck multi-check a/b test on test2wiki (T384372)]], [[gerrit:1128777|Growth: enable new way of refreshing LinkRecommendations for pilots (T386250)]] (duration: 16m 36s)

It should now be enabled for all Surfacing Add Link experiment wikis + cswiki. Let's move this to blocked for now and wait and see what the data looks like.

Since we enabled the new way of doing things a while ago and things are looking good so far, we maybe can take the next step and enable it on a bunch more wikis.

My suggestion, separated by db-sections would be:

Section 1:

  • enwiki?

Section 2:

  • itwiki
  • nlwiki

Section 5:

  • dewiki
  • srwiki
  • shwiki

Section 6:

  • ruwiki

Section 7:

  • cawiki
  • heviki
  • viwiki

Change #1164287 had a related patch set uploaded (by Michael Große; author: Michael Große):

[operations/mediawiki-config@master] Growth: enable new way of refreshing LinkRecommendations for more wikis

https://gerrit.wikimedia.org/r/1164287

Michael lowered the priority of this task from High to Medium.Jul 8 2025, 4:37 PM

Change #1164287 merged by jenkins-bot:

[operations/mediawiki-config@master] Growth: enable new way of refreshing LinkRecommendations for more wikis

https://gerrit.wikimedia.org/r/1164287

Mentioned in SAL (#wikimedia-operations) [2025-07-28T13:27:24Z] <lucaswerkmeister-wmde@deploy1003> Started scap sync-world: Backport for [[gerrit:1164287|Growth: enable new way of refreshing LinkRecommendations for more wikis (T386250 T392944)]], [[gerrit:1171548|Echo: be explicit about special wikis using Wikipedia logo (T400070)]]

Mentioned in SAL (#wikimedia-operations) [2025-07-28T13:29:21Z] <lucaswerkmeister-wmde@deploy1003> lucaswerkmeister-wmde, migr: Backport for [[gerrit:1164287|Growth: enable new way of refreshing LinkRecommendations for more wikis (T386250 T392944)]], [[gerrit:1171548|Echo: be explicit about special wikis using Wikipedia logo (T400070)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-07-28T13:35:31Z] <lucaswerkmeister-wmde@deploy1003> Finished scap sync-world: Backport for [[gerrit:1164287|Growth: enable new way of refreshing LinkRecommendations for more wikis (T386250 T392944)]], [[gerrit:1171548|Echo: be explicit about special wikis using Wikipedia logo (T400070)]] (duration: 08m 06s)

Michael changed the task status from Open to In Progress.Sep 29 2025, 11:11 AM

While this is still ongoing, this is not our focus this sprint. Also, we may wish to rethink our Add a Link code a bit more comprehensively, in light of other improvements becoming newly unlocked like: