Background
As of writing, refreshLinkRecommendations iterates through all article topics and for each of them, it does the following steps:
- Requests a batch of random 500 articles belonging to that topic
- For each article, it attempts to generate an Add Link recommendation. If it succeeds, it adds it to the task pool.
- If at least one new task was generated in the last batch of 500 articles, it moves back to step 1. Otherwise, it considers the topic exhausted, and moves to the next topic in the list.
This is done under the assumption that if none of the 500 articles are worthy for inclusion, then there are simply no viable suggestions in that topic. However, that doesn't necessarily need to be true – we can be unlucky, and out of the 10,000 articles the topic has, we receive 500 that indeed do not have any recommendation (but the next batch would have). It is also done because we do not know which of the random batches is the last one.
Originally, this method was likely selected to control how many recommendations are in the task pool for each task. Unfortunately, this method does not really work, as each article can be (and usually is) in more than topic. If most articles in the africa topic are about notable Africans, then getting more tasks for the africa topic also means getting more tasks for the biografy topic. This results in significant differences between various topics. For example, the smallest non-empty task at eswiki has a single task (the architecture topic), while the biography topic has over 10k of articles (the threshold is set to 2k of articles per topic).
Problem
Within this task, we should implement a task pool refreshing logic that does not involve iterating across topics. We will need to both think about options and implement the final choice. Several options are available below.
Options
- Iterate over all articles ordered by their page ID
- Iterate over all articles randomly (for example, using the page_random column)
- Iterate over articles ordered by their last edit timestamp (perhaps excluding articles that were edited too recently)
- Something else
In all cases, we would need to introduce a new threshold (the desired total task pool size). A good starting value for that might be 500*<topic count> (we have 39 topics, so around 20,000 in total). Considering the recent bumping (T386248), it might make sense to set it to even more (50,000 or 100,000).
Iterating in a stable sort would allow us to simplify the code a lot (we would not need to worry about iterating over an article twice). However, it might skew the distribution of the task pool content in a particular manner, which is something we should attempt to avoid as much as possible.
Final solution
To be determined.