Page MenuHomePhabricator

Deploy "add a link" to all Wikipedias
Open, In Progress, MediumPublic

Description

After having "add a link" on about 10 wikis for several months, we learned about valuable improvements to make. Those improvements are collected in this epic: T300851: [EPIC] Growth: "add a link" structured task 2.0. Once those improvements are complete, we will be comfortable deploying "add a link" more broadly. This task is about generating the suggestions for another set of wikis and deploying there.

This process was started in T290011: [OLD] Deploy Add a link to a third round of wikis, which is kept under this task for reference.

Related Objects

StatusSubtypeAssignedTask
Openlbowmaker
In ProgressTrizek-WMF
DeclinedNone
ResolvedTrizek-WMF
ResolvedTgr
ResolvedTrizek-WMF
ResolvedTgr
ResolvedEtonkovidova
Resolvedkevinbazira
Resolvedkevinbazira
Resolvedkostajh
OpenSgs
In ProgressTrizek-WMF
StalledTrizek-WMF
In ProgressTrizek-WMF
Openkevinbazira
In ProgressTrizek-WMF
ResolvedNone
Openkevinbazira
Resolvedkevinbazira
OpenNone
Invalidkevinbazira
Openkevinbazira
Openkevinbazira
OpenNone
OpenNone
OpenNone
OpenNone
Openkevinbazira
OpenNone
Opencalbon

Event Timeline

@Trizek-WMF -- in T290011: [OLD] Deploy Add a link to a third round of wikis, you proposed a list of wikis to be in the third round of "add a link". I made this fresh task to restart the process since it happened several months ago. Could you please assemble a new list? I actually think it can be a lot bigger than your original list because we have strong data about the value of "add a link", few complaints from communities, and have made many improvements. I think we should deploy to something like 30-50 more wikis.

When the list of wikis is shared, the ML team will generate models and datasets for the specified wikis. Then publish the datasets and hand them over to @kostajh for the MediaWiki specific configurations.

@Trizek-WMF -- in T290011: [OLD] Deploy Add a link to a third round of wikis, you proposed a list of wikis to be in the third round of "add a link". I made this fresh task to restart the process since it happened several months ago. Could you please assemble a new list? I actually think it can be a lot bigger than your original list because we have strong data about the value of "add a link", few complaints from communities, and have made many improvements. I think we should deploy to something like 30-50 more wikis.

We can keep the list we had already defined, and we can add some more wikis to the list. We have a few wikis that would need a close Community Relations work, but this list would be mainly built depending on when the models would be available.

@kevinbazira, how much time would it take to generate models and datasets for let's say 10, 30, 50, 100, all Wikipedias? Just to know how many rounds we could reasonably have, and an estimation of time.

@Trizek-WMF, the time it takes to generate models and datasets depends on how big a wiki is: a smaller wiki approximately 2 to 3 hours; a big wiki could go upwards of 6 hours.

So for 10 wikis, we could be looking at approximately 30 to 60 hours depending on the size of each of these wikis.

[...]
@kevinbazira, how much time would it take to generate models and datasets for let's say 10, 30, 50, 100, all Wikipedias? Just to know how many rounds we could reasonably have, and an estimation of time.

@Trizek-WMF, the time it takes to generate models and datasets depends on how big a wiki is: a smaller wiki approximately 2 to 3 hours; a big wiki could go upwards of 6 hours.

So for 10 wikis, we could be looking at approximately 30 to 60 hours depending on the size of each of these wikis.

AFAIK, once the models are generated & published, GrowthExperiments needs to populate its task pool. I'm not sure if the time for that is included in the 30-60 estimation.

@Urbanecm_WMF, nope it's not included. The 30-60 estimation was for the time it takes to generate models and datasets.

From the Growth side, training the models, along with creating and publishing the datasets, could begin at any time.

We are not ready to start filling up the task pools, though, because we need to add some configuration in place to support T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links.

As an aside, @kevinbazira it is probably worth retraining and recreating the datasets for all the existing wikis, as many of them are nearly a year old, and would benefit from being recreated with fresh data (cc @MGerlach if he has opinions on that).

We can keep the list we had already defined, and we can add some more wikis to the list.

From the Growth side, training the models, along with creating and publishing the datasets, could begin at any time ... As an aside, @kevinbazira it is probably worth retraining and recreating the datasets for all the existing wikis, as many of them are nearly a year old, and would benefit from being recreated with fresh data.

Following the comments highlighted above, the ML team has started generating models and datasets for the wikis below:

  • Catalan Wikipedia
  • Hebrew Wikipedia
  • Hindi Wikipedia
  • Korean Wikipedia
  • Norwegian Bokmål Wikipedia
  • Portuguese Wikipedia
  • Simple English Wikipedia
  • Swedish Wikipedia
  • Ukrainian Wikipedia

We shall keep checking off each wiki when it's completed and will communicate when they are all done and ready to publish.

Trizek-WMF renamed this task from Deploy "add a link" to third round of wikis to Deploy "add a link" to all Wikipedias.Mar 23 2022, 4:51 PM

I renamed the task so that we can have sub-tasks that make sense. I will create a sub-task for each round, and we will then be able to coordinate on models, community information and deployment.

@kevinbazira, is there a limit on the Wikipedias we can train the model on? Brand new wikis can be included?

@Trizek-WMF, there is no limit per se. Though the model training pipeline uses dumps. So we can train a model for a wiki that exists in the database dumps.

I created several groups. They combine small, medium and big wikis; by chance, the alphabetical order was matching the combination. Let me know if these groups are the right size. I can populate them more if needed.

Round 3 is ready. I will announce it to the communities. To do so, I need to know what are the next steps to implement Add links to new wikis.

How much time will it take to deploy to 10 more wikis?
It must be deployed using the train, right?
Who will do the deployment? The Growth team, I guess? (ping @kostajh)

Round 3 is ready. I will announce it to the communities. To do so, I need to know what are the next steps to implement Add links to new wikis.

How much time will it take to deploy to 10 more wikis?

Which ones?

It must be deployed using the train, right?

No, the steps are outlined in the various subtasks -- train the models and make the datasets, then a config patch to enable the backend for each wiki, then after there are enough available tasks for each wiki, we enable the frontend.

Who will do the deployment? The Growth team, I guess? (ping @kostajh)

The Growth team does the frontend/backend config patches, the Machine-Learning-Team does the training/dataset steps.

Thank you @kostajh. The wikis already ready are listed on T304542: Deploy "add a link" to third round of wikis. I'm editing the task accordingly.

We can announce a deployment the week after the training is done, and for announcement purposes, on a Wednesday (time to have Tech News reaching at all wikis, and enough time before Friday).

Thank you @kostajh. The wikis already ready are listed on T304542: Deploy "add a link" to third round of wikis. I'm editing the task accordingly.

We can announce a deployment the week after the training is done, and for announcement purposes, on a Wednesday (time to have Tech News reaching at all wikis, and enough time before Friday).

I would adjust the timing – after the backend patch is enabled, we need to wait some period of time (days, maybe 1-2 weeks) to see that there are enough link recommendation tasks for the wiki (see the task pool section of https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1). Once we see there are enough tasks, then we can enable the frontend.

Thank you @kostajh. The wikis already ready are listed on T304542: Deploy "add a link" to third round of wikis. I'm editing the task accordingly.

We can announce a deployment the week after the training is done, and for announcement purposes, on a Wednesday (time to have Tech News reaching at all wikis, and enough time before Friday).

I would adjust the timing – after the backend patch is enabled, we need to wait some period of time (days, maybe 1-2 weeks) to see that there are enough link recommendation tasks for the wiki (see the task pool section of https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1). Once we see there are enough tasks, then we can enable the frontend.

So it would be a deployment around April 15, on a Wednesday.

Thank you @kostajh. The wikis already ready are listed on T304542: Deploy "add a link" to third round of wikis. I'm editing the task accordingly.

We can announce a deployment the week after the training is done, and for announcement purposes, on a Wednesday (time to have Tech News reaching at all wikis, and enough time before Friday).

I would adjust the timing – after the backend patch is enabled, we need to wait some period of time (days, maybe 1-2 weeks) to see that there are enough link recommendation tasks for the wiki (see the task pool section of https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1). Once we see there are enough tasks, then we can enable the frontend.

So it would be a deployment around April 15, on a Wednesday.

Maybe, with everything happening in Growth-Team (Current Sprint) + various people being away around spring holidays, I am not confident on that date.

@kevinbazira, all rounds have been created, they are sub-tasks of the current one. Please proceed on model training as you can! :)

Trizek-WMF changed the task status from Open to In Progress.May 11 2022, 5:44 PM

Also @kevinbazira, I tried to make rounds that gather approximatly the same number of wikis, with a lot of small ones plus a big one, or a few mid-sized ones with some small wikis. Let me know if I should change the number of wikis there to help your work building the models (make bigger batches, or smaller ones).

@Trizek-WMF, thank you for creating all the rounds. I am working on generating datasets and models round by round and will be sharing updates on the sub-tasks.

The task generation for the Add Link task pool runs on one thread per DB server group (s1-s7). The two large DB server groups (in terms of number of wikis) are s5 (21 Wikipedias) and s3 (280 Wikipedias). For mid-sized wikis task generation took about 5-6 hours for a single wiki. It will probably take the same time for large wikis as well (we are going for a constant number of wikis, regardless of size); might take less time for small wikis where candidates can be exhausted before the required number of tasks are found. But assuming it doesn't, 280 Wikipedias is ~70 days, so we should probably run 4 instances of the refresh script in parallel on s3.

The script creates a lock with the ID of the wiki it is processing, so running multiple instances in parallel on the same dblist should be fine - the second thread will just skip the wiki that the first is already processing.

Proposal for streamlining the completion of remaining rounds:

  1. Train models, verify models, publish datasets for all remaining wikis. This involves work from Machine-Learning-Team (cc @kevinbazira) and Research (cc @MGerlach). Growth doesn't need to be consulted on this phase; once these teams are happy with the models, the datasets can be published.
  2. Growth engineers will populate the excluded section titles for all remaining wikis
  3. Growth engineers will enable the backend for all remaining wikis, so the task pools begin to fill up
  4. @Trizek-WMF can then verify the hasrecommendation:link results and check the API https://api.wikimedia.org/service/linkrecommendation/apidocs/#/default/get_v1_linkrecommendations__project___domain___page_title_
  5. @Trizek-WMF can inform communities, and Growth engineers can enable the front-end, either staggered or en masse, depending on what works better for @Trizek-WMF

tl;dr I think we would all save time with context switching if we can do the model training, section title population, and backend enabling all at once, rather than in phases over a period of months. Then the actual presentation to communities could be done in a more staggered way if that is better from a community relations standpoint.

@KStoller-WMF @kevinbazira @MGerlach what do you think?

@kostajh +1 on doing the initial phases at once to avoid context switching.

The ML team will proceed with training models, evaluating them, and publishing datasets for all the remaining rounds.

As far as we keep a staggered deployment, I'm fine. :)

+1
This sounds like a great approach!

Trizek-WMF changed the task status from In Progress to Stalled.Jan 17 2023, 5:36 PM
Trizek-WMF changed the status of subtask T304953: Schedule the deployment of "Add a link" to more wikis from In Progress to Stalled.

We moved to a staggered deployment process. When all wikis will have trained models, then we will resume deployments.

When all wikis will have trained models, then we will resume deployments.

@kevinbazira
Do we have a rough ETA for when model training will be done for all Wikipedias? Thanks!

@KStoller-WMF, we are currently working on the 9th out of 18 rounds of wikis. Each of the 9 remaining rounds has ~20 models. ETA to train, evaluate, and publish all these models is about a month or more depending on the size of each of these wikis and whether or not we have to fine-tune the link recommendation algorithm to support a wiki's language-specific characters.

Will be sharing progress updates on the sub-tasks.

Sgs changed the status of subtask T308133: Deploy "add a link" to 8th round of wikis from Open to In Progress.
Sgs changed the status of subtask T308134: Deploy "add a link" to 9th round of wikis from Open to In Progress.
Trizek-WMF changed the task status from Stalled to In Progress.Tue, Mar 14, 1:15 PM
Trizek-WMF added a project: Epic.