Page MenuHomePhabricator

Newcomer tasks: Improve how we cache TaskSets
Open, Needs TriagePublic


When we first built newcomer tasks, the user had to wait for client-JS to load, and for the JS to perform a search query to ElasticSearch. This process was slow.

Eventually, we did the following:

The benefit to the user is that on visit to Special:Homepage they should have more or less instantly an up-to-date set of task suggestions and once the (cached) JavaScript for the Suggested Edits module loads, they're able to page through the task set that we've sent as part of the server-side render response.

A downside to our TaskSet caching implementation, however, is that it's only accessible per user ID. That means we can't efficiently update the contents of the TaskSet in response to events that should invalidate a task within the TaskSet (for example: a maintenance template is removed from an article so it should no longer be considered a "copyedit" task; or a user completes a link recommendation task, so that article/task-type pair should be removed from all user tasksets).

We have a number of workarounds to prevent the user from seeing invalid tasks in their suggested edits queue:

  • LinkRecommendationFilter: checks to see if there is a database entry associated with the link recommendation task
  • ImageRecommendationFilter: checks to see if the image recommendation is not in a cache bin of invalidated image recommendations
  • ProtectionFilter: does a DB query to see if the article is protected
  • SearchTaskSuggester::filter: does an ElasticSearch query with the page IDs in the task set to validate that the tasks associated with maintenance templates are still valid. (We used to have a TemplateFilter that did this via DB queries, but it had performance issues T267216: Slow load times for Special:Homepage on cswiki.)

These approaches all work, more or less – T317187 seems to be related to the SearchTaskSuggester::filter. Although that is something of a special case with ElasticSearch upgrade, it still highlights some fragility in our implementation. Still, each additional filter adds load time and complexity to the code.

Thinking of potential improvements, here are some ideas:

  • ElasticSearch remains the source of truth for which tasks exist, and which articles those tasks are associated with. Let's not change that.
  • Instead of caching TaskSets in per-user cache bins, move the cache to database tables

I'm thinking there would be a few database tables:


IDarticle IDtask type IDdateCreated
auto-incrementthe page ID, e.g. 123e.g. "copyedit" or "link-recommendation"timestamp of when the task was seen when querying ElasticSearch


user IDtask IDdateCreatedfilter ID
user ID, e.g. 456ID from growthtaskstimestamp of when the taskset was createdreference to taskset filters used


IDuser IDfilters
auto-incrementuser ID, e.g. 456JSON encoding of the topic/task type filters, as well as revision ID of MediaWiki:NewcomerTasks.json, used to generate a taskset

We would populate growthcachetasks when SearchTaskSuggester gets a result set from ElasticSearch. Each article/task-type pair would go into growthcachetasks as individual rows.

When constructing the TaskSet, we would save each item as a row in growthcachetasksets, referencing the IDs from growthcachetasks. To fetch a TaskSet for a user, we'd select rows from growthcachetasksets by user ID and join on growthcachetasks.

We could then change a few things in our workflow. For example, when tasks are invalidated because a maintenance template is removed, the article protection status changes, or a link recommendation/image recommendation task is completed, we can update the growthcachetasks table to identify and remove the relevant row, and then update the growthcachetasksets table to remove the invalidated task from tasksets. We could also kick off a job to find a new task to put in the user's taskset to replace what was removed.

We'd then be able to remove LinkRecommendationFilter, ImageRecommendationFilter, ProtectionFilter, and SearchTaskSuggester::filter, because the cache tables would be kept up to date in response to events that invalidate tasks. We could also remove the infrastructure we have for automatically refreshing a user's TaskSet every 6 days.

Event Timeline

kostajh added subscribers: Sgs, Tgr.

cc @Tgr, @Sgs and @Urbanecm. After writing all of this out, I am not entirely convinced it's worth the effort, but I think it's a good idea to think through how we can improve the resiliency and performance of our existing setup. I welcome your thoughts and proposals!

A simpler form of this might be to skip the task and tasksetfilters tables, and just create a table for tasksets to begin with.

I'm moving this upcoming work, with the idea that we'd start prototyping to see if there are any useful, iterative improvemnts we can make.