Page MenuHomePhabricator

Assess database requirements for link recommendations reading entry point
Open, Needs TriagePublic

Description

Link recommendations use a bunch of tables called growthexperiments_link_recommendations (one per wiki) on x1, which cache data from a recommendation system (which is slow). Currently we are keeping a contant pool of ~20K articles per wiki, which is enough to give users a feed of link recommendation tasks within some article topic they choose. But if we wanted to suggest people link recommendation tasks about the article they are reading at the moment (the project name for this is "entry point in reading experience"), we'd need this data for all articles.

We want to assess 1) if it would be reasonable to run an experiment on a few mid-size wikis to test how much a reading entry point would help with turning readers into editors and retaining new editors; 2) whether it would be feasible to scale up to all wikis eventually, 3) whether it would help or hurt or be necessary / impossible to move these tables out of MediaWiki (they just cache responses for a Kubernetes-based web service, so logically they could just as easily live in a database belonging to that service).

Currently the table size is something like 50-100M (so about 5K per wiki). On cswiki, which is our go-to wiki for testing new features, including every article would take about 2G. On enwiki, it would be about 20G.

Background:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

But if we wanted to suggest people link recommendation tasks about the article they are reading at the moment (the project name for this is "entry point in reading experience"), we'd need this data for all articles.

An alternative to caching recommendations for all articles would be to fetch data on the fly. Especially when limiting the max recommendations to a value of 2 (the minimum on most wikis), getting the data is something like ~2 seconds. Presumably anything less than 10 seconds would work from a product perspective, where a user is actually reading/skimming an article before seeing that there are possible changes to make?

Also, given that this would be shown to logged-in newcomers only (who would probably also be able to opt out of this feature, I imagine), the traffic should be manageable.

But if we wanted to suggest people link recommendation tasks about the article they are reading at the moment (the project name for this is "entry point in reading experience"), we'd need this data for all articles.

An alternative to caching recommendations for all articles would be to fetch data on the fly. Especially when limiting the max recommendations to a value of 2 (the minimum on most wikis), getting the data is something like ~2 seconds. Presumably anything less than 10 seconds would work from a product perspective, where a user is actually reading/skimming an article before seeing that there are possible changes to make?

Also, given that this would be shown to logged-in newcomers only (who would probably also be able to opt out of this feature, I imagine), the traffic should be manageable.

That's an interesting idea, @kostajh -- but perhaps then there would be issues around the volume of using the API? Like hitting it to get link suggestions for every page load of an article would be a lot. I imagine we would need business rules to decide which users get it, and for what articles?

But if we wanted to suggest people link recommendation tasks about the article they are reading at the moment (the project name for this is "entry point in reading experience"), we'd need this data for all articles.

An alternative to caching recommendations for all articles would be to fetch data on the fly. Especially when limiting the max recommendations to a value of 2 (the minimum on most wikis), getting the data is something like ~2 seconds. Presumably anything less than 10 seconds would work from a product perspective, where a user is actually reading/skimming an article before seeing that there are possible changes to make?

Also, given that this would be shown to logged-in newcomers only (who would probably also be able to opt out of this feature, I imagine), the traffic should be manageable.

That's an interesting idea, @kostajh -- but perhaps then there would be issues around the volume of using the API? Like hitting it to get link suggestions for every page load of an article would be a lot. I imagine we would need business rules to decide which users get it, and for what articles?

Right. Calling on the fly would be something to consider if this reading entrypoint is done for a subset of page views, not for all traffic to a page. Some variables we could play with:

  • all users or only authenticated users?
  • all authenticated users or only newcomers?
  • all newcomers or only accounts created in the last N days?
  • issue the API request immediately on page load or wait N seconds (or after the article is scrolled) before querying the link recommendation API?

If the goal is to tell the user that link suggestions exist for the article immediately on page load, then we should go with the proposal in this task, which is to generate recommendations for as many articles as possible on the wiki.

Per discussion with @Ladsgroup:

  • the +2G of enabling for all articles on cswiki shouldn't be a problem
  • if we want to do this for all wikis (~50G?), we should offset it by freeing up space. T308084: Reduce DB space used by Echo notifications seems fairly easy to do and would probably free up more space than that.

Yup. My only request would be to be careful about massive writes or reads when enabling it in really large wikis (coordinate with the DBAs beforehand) but generally x1 is in a healthy state.

So, what is the next step? Should we try enable for all articles on cswiki and see how it goes?

The reading entry point is probably a large enough project that it needs to be slotted into our annual plan. Generating tasks for all cswiki articles does not take much coding, but it doesn't seem that valuable as an experiment so I'd wait until we are at the point where we actually need it.

The reading entry point is probably a large enough project that it needs to be slotted into our annual plan. Generating tasks for all cswiki articles does not take much coding, but it doesn't seem that valuable as an experiment so I'd wait until we are at the point where we actually need it.

Alright, moving off the current sprint board then and into Triaged.

Thank you all! Yes, that's right @kostajh -- this would be an annual planned project. This ticket was just to begin the thought process, and we'll figure out when to actually proceed here.

Marostegui removed a project: DBA.