Page MenuHomePhabricator

Provide a mechanism to regenerate link recommendation task pool after configuration changes
Open, Needs TriagePublic

Description

Most of the changes made to Growth feature configuration via Special:EditGrowthConfig are effective immediately.

But some changes related to link recommendation only take effect after the task pool has been rebuilt. This would changes to:

  • minimum threshold per link recommendation
  • section names in which links should no longer be recommended
  • minimum/maximum links per task

The task pool is currently rebuilt regularly as users complete link recommendation tasks, or as non-link recommendation edits are made to relevant articles. That means it can take a long time for the pool to be completely rebuilt.

Ideally, there would be a mechanism that would allow for rapidly removing pre-configuration-change tasks and replacing them with post-configuration-change tasks. Since that's an expensive process, we need some kind of protection on who could invoke the changes, and also how often it can be done. And we'd want some kind of monitoring so users can see how far along the process is.

We currently store a hash of the dataset IDs used for creating a cached link recommendation, we should probably also store a hash related to the link-recommendation settings in Special:EditGrowthConfig.

Acceptance Criteria

  1. EditGrowthConfig users should be able to adjust link-recommendation settings and see the changes take effect within 24 hours
  2. Monitoring should be available to see how far long task pool regeneration is after a config switch
  3. ...?
Completion checklist

Functionality

  • The patches have been code reviewed and merged
  • The task passes its acceptance criteria

Engineering

  • There are existing and passing unit/integration tests
  • Tests for every involved patch should pass
  • Coverage for every involved project should have improved or stayed the same

Design & QA

  • If the task is UX/Design related: it must be reviewed and approved by the UX/Design team
  • Must be reviewed and approved by Quality Assurance.

Documentation

  • Related and updated documentation done where necessary

Event Timeline

Editing the configuration is already limited to admins and interface editors, I don't think we need more protection than that.

I think the new functionality we need is:

  • Move most of the revalidateLinkRecommendations.php maintenance script into a shared utility class, and create a RevalidateLinkRecommendationsJob which does the same thing as the maintenance script. Alternatively, instead of revalidation, we can just drop all existing tasks (in some staggered way) and rely on the existing task generation process for replacing them.
    • Find out if a long-running job is a problem (I think it would take a few hours to 1-2 days, depending on the size of the wiki), and if so, find a way to break up into smaller chunks. (GWToolset has an example of a parent job periodically rescheduling itself and scheduling child jobs to do chunks of the work; I'm not sure it's considered to be a good pattern, though. Another approach is the one taken by GlobalRename, where status is tracked in a DB table, and jobs reschedule their own follow-ups until the status table is cleared. Also not necessarily a good example, it can be fragile; but maybe a similar approach using a maintenance script to schedule jobs would work.)
  • Trigger the job when the configuration is edited in a relevant way: the minimum threshold is raised,
  • Make sure that whenever the configuration is edited, ongoing jobs which have been triggered by a previous edit get shut down. (If the jobs run in small chunks, this might not be necessary.) Maybe possible using job deduplication?
  • Adjust the "Changes to this setting won't take effect immediately" message shown on relevant configuration form fields as needed.
  • Maybe speed up the revalidation script by making sure it only revalidates tasks when needed. In some cases this is easy (for score threshold, we can just check the scores of each recommendation against the new threshold; we might even skip requesting a new recommendation if after filtering out the links with a low score, we are still left with enough links), in some cases it might be too much effort to be worth it (when changing the section exclusion link, we'd have to know which section each link is in, and we'd have to update the service to add that information).

The job batching/continuing/controlling problem seems generic enough that it would benefit from wider discussion.

Editing the configuration is already limited to admins and interface editors, I don't think we need more protection than that.

What if a well-intentioned admin makes multiple changes that trigger link recommendation regeneration in rapid succession? That seems potentially problematic.

The job batching/continuing/controlling problem seems generic enough that it would benefit from wider discussion.

Right, we can discuss that more in T299021, and treat this task as a specific use case to fix.

What if a well-intentioned admin makes multiple changes that trigger link recommendation regeneration in rapid succession? That seems potentially problematic.

It would require some sort of deduplication or semaphore. Which is a good idea anyway, as the process might take long on large wikis and there is no way to tell when it's finished. Depending on the implementation used, it might or might not be easy to add. I don't think it's easy to avoid the problem via permissions, though. If we want to limit changes to developers, we just shouldn't expose it in community configuration; otherwise, I don't think an interface admin is that much less likely to make multiple changes than an admin.

OK, we've discussed this task and have some ideas. I believe we can move it into "Triaged" until we're ready to focus on T315732: [EPIC] Structured Tasks: Patroller Focus, or maybe since that epic is in progress, the correct place for this task is the Current Sprint board? @KStoller-WMF @MShilova_WMF what do you think?

Let's move to the current sprint with the hope that we can start this within the next month.

We don't seem to have the capacity to work on this now, so I'm moving the task out of current sprint and shifting to a later epic.

Michael raised the priority of this task from Low to Needs Triage.Wed, Dec 18, 6:34 PM
Michael added a project: GrowthExperiments.

Idea for how this could be implemented:

  1. Store hash (rev-id?) for link recommendation config in the DB in its own column (we are considering doing this anyway)
  2. When the link recommendation config changes, then schedule job with the new changed hash
  3. When the Job runs, it gets the current hash of the config and compares that to its own. If the hash is different, the job exits (config changed again -> outdated job)
  4. Then the jobs asks the database (link-recommendation table) for 10 (100?) suggestions which do not have the hash of the job
  5. Then the job revalidates them and stores the revalidated suggestion together with the new hash
  6. If the job got all recommendations it requested from the DB, then it schedules itself again (because there might still be more to revalidate)