When we first built newcomer tasks, the user had to wait for client-JS to load, and for the JS to perform a search query to ElasticSearch. This process was slow.
Eventually, we did the following:
- moved to server-side rendering of suggested edits module (T236738: Newcomer tasks: server-side rendered version of suggested edits module)
- cached results from ElasticSearch in a TaskSet object that we store in WANObjectCache (T260758: Cache newcomer tasks per user)
- export the task queue from the server-side (T308542: Suggested edits: Populate task queue from server-side)
A downside to our TaskSet caching implementation, however, is that it's only accessible per user ID. That means we can't efficiently update the contents of the TaskSet in response to events that should invalidate a task within the TaskSet (for example: a maintenance template is removed from an article so it should no longer be considered a "copyedit" task; or a user completes a link recommendation task, so that article/task-type pair should be removed from all user tasksets).
We have a number of workarounds to prevent the user from seeing invalid tasks in their suggested edits queue:
- LinkRecommendationFilter: checks to see if there is a database entry associated with the link recommendation task
- ImageRecommendationFilter: checks to see if the image recommendation is not in a cache bin of invalidated image recommendations
- ProtectionFilter: does a DB query to see if the article is protected
- SearchTaskSuggester::filter: does an ElasticSearch query with the page IDs in the task set to validate that the tasks associated with maintenance templates are still valid. (We used to have a TemplateFilter that did this via DB queries, but it had performance issues T267216: Slow load times for Special:Homepage on cswiki.)
These approaches all work, more or less – T317187 seems to be related to the SearchTaskSuggester::filter. Although that is something of a special case with ElasticSearch upgrade, it still highlights some fragility in our implementation. Still, each additional filter adds load time and complexity to the code.
Thinking of potential improvements, here are some ideas:
- ElasticSearch remains the source of truth for which tasks exist, and which articles those tasks are associated with. Let's not change that.
- Instead of caching TaskSets in per-user cache bins, move the cache to database tables
I'm thinking there would be a few database tables:
|ID||article ID||task type ID||dateCreated|
|auto-increment||the page ID, e.g. 123||e.g. "copyedit" or "link-recommendation"||timestamp of when the task was seen when querying ElasticSearch|
|user ID||task ID||dateCreated||filter ID|
|user ID, e.g. 456||ID from growthtasks||timestamp of when the taskset was created||reference to taskset filters used|
|auto-increment||user ID, e.g. 456||JSON encoding of the topic/task type filters, as well as revision ID of MediaWiki:NewcomerTasks.json, used to generate a taskset|
We would populate growthcachetasks when SearchTaskSuggester gets a result set from ElasticSearch. Each article/task-type pair would go into growthcachetasks as individual rows.
When constructing the TaskSet, we would save each item as a row in growthcachetasksets, referencing the IDs from growthcachetasks. To fetch a TaskSet for a user, we'd select rows from growthcachetasksets by user ID and join on growthcachetasks.
We could then change a few things in our workflow. For example, when tasks are invalidated because a maintenance template is removed, the article protection status changes, or a link recommendation/image recommendation task is completed, we can update the growthcachetasks table to identify and remove the relevant row, and then update the growthcachetasksets table to remove the invalidated task from tasksets. We could also kick off a job to find a new task to put in the user's taskset to replace what was removed.
We'd then be able to remove LinkRecommendationFilter, ImageRecommendationFilter, ProtectionFilter, and SearchTaskSuggester::filter, because the cache tables would be kept up to date in response to events that invalidate tasks. We could also remove the infrastructure we have for automatically refreshing a user's TaskSet every 6 days.