Page MenuHomePhabricator

Store the fact that Add Link did not generate any recommendation for a page, don't try again
Closed, ResolvedPublic5 Estimated Story Points

Description

From T378527#10384905

Can we somehow store the information that we did not find any suggestions for a page? Currently we just plain do nothing in that case and we then try again the next time around. It would be nice to be able to skip that extra work by somewhere (page prop? cirrus search weighted tag? GE-specific db entry? ...?) storing that for the combination of page-revisionID + community-config-hash (+ model number?) we did not have any suggestion, so we do not need to try again.

Today we discussed this task and how to move forward with the related approach of generating more suggestions (T378527, T378536 and T382404). We concluded that implementing the feature discussed in this task here would be highly benefitting, regardless of which approach is ultimately picked.

We decided that the easiest way to implement this is to alter the existing growthexperiments_link_recommendations table and to make the gelr_data field nullable. That field being null would then be interpreted as that we tried to get a recommendation for a page from the service but did not succeed. Such a row would usually be removed by the existing process when a page has been edited or through revalidation (currently implemented via a manual maintenance script).

Related Objects

Event Timeline

Michael renamed this task from Investigate if how to store the info that we did not find suggestions for a page to Investigate how to store the info that we did not find suggestions for a page.Jan 13 2025, 5:43 PM

Today we discussed this task and how to move forward with the related approach of generating more suggestions (T378527, T378536 and T382404). We concluded that implementing the feature discussed in this task here would be highly benefitting, regardless of which approach is ultimately picked.

We decided that the easiest way to implement this is to alter the existing growthexperiments_link_recommendations table and to make the gelr_data field nullable. That field being null would then be interpreted as that we tried to get a recommendation for a page from the service but did not succeed. Such a row would usually be removed by the existing process when a page has been edited or through revalidation (currently implemented via a manual maintenance script).

Change #1111673 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] sql: Make gelr_data nullable

https://gerrit.wikimedia.org/r/1111673

Urbanecm_WMF renamed this task from Investigate how to store the info that we did not find suggestions for a page to Store Add Link did not generate any recommendation for a page.Jan 15 2025, 5:58 PM
Urbanecm_WMF added a subscriber: Sgs.

@Sgs mentioned in backlog refinement the results of the API might differ when new pages get created (if there is a linkable term without an existing article, and that article is later created). The question is, does the API actually behave that way? If it is, we might need to retry after N days, to avoid the article being forgotten forever.

@Sgs mentioned in backlog refinement the results of the API might differ when new pages get created (if there is a linkable term without an existing article, and that article is later created). The question is, does the API actually behave that way? [...]

Based on the comment, I wonder if that is what https://gerrit.wikimedia.org/r/plugins/gitiles/research/mwaddlink/+/47c0636e0053184eda19ece864405e027fc3d82b/src/scripts/utils.py#391 is doing?

If it is, we might need to retry after N days, to avoid the article being forgotten forever.

Mh. How would we know when N days are over?

@Sgs mentioned in backlog refinement the results of the API might differ when new pages get created (if there is a linkable term without an existing article, and that article is later created). The question is, does the API actually behave that way?

Since I had no luck when going through mwaddlink's code myself, I checked with @MGerlach to see how the service behaves wrt existence of articles. While mwaddlink does check the target page exists, it happens when the model is trained, not during the API call. The relevant code is src/scripts/generate_anchor_dictionary_spark.py#L213. According to Martin G.:

  • we only ever suggest links that occurred in other articles (i.e. candidates from the the "anchor-dictionary")
  • in the anchor dictionary we keep only those candidates which correspond to an existing article
  • this is based on some snapshot of the dump (depending when we trained the model). so we might get few mistakes from outdated information
  • the code checks that the link target (resolved page title) matches a page title from the main namespace that is not a redirect

Because the page existence is checked on training, the API wouldn't notice new articles exists until the model is retrained. This means that removing rows where gelr_data IS NULL on revalidation (more specifically: on model retraining) should be sufficient to deal with new pages getting created. This also means we don't need to store any extra information for the "no suggestion returned" case and as such, we can proceed with making gelr_data nullable as previously agreed.

Thanks @MGerlach for the pointers and @Sgs for pointing this potential problem out!

Good to know! This does mean we can indeed move forward here.

On the other hand, this does also mean that not retraining models is a problem, especially for smaller languages that might still be growing a lot...

Let's keep the scope of this task to architecture + implementing the change itself (blocked by the schema change). I filled T383864: Make growthexperiments_link_recommendations.gelr_data nullable in GrowthExperiments for the schema change (to-be-completed within this sprint); there will be a follow-up task for DBA to actually perform the change in production.

Urbanecm_WMF edited projects, added Growth-Team; removed Growth-Team (Current Sprint).

Since we do not aim to make the change itself within this sprint, moving this out of sprint for now.

Blocked on schema change.

Urbanecm_WMF removed a project: Patch-For-Review.
Michael renamed this task from Store Add Link did not generate any recommendation for a page to Store the fact that Add Link did not generate any recommendation for a page.Jan 21 2025, 3:50 PM
Michael renamed this task from Store the fact that Add Link did not generate any recommendation for a page to Store the fact that Add Link did not generate any recommendation for a page, don't try again.
Urbanecm_WMF set the point value for this task to 5.Jan 21 2025, 3:58 PM

Change #1114036 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] feat(AddLink): store null if there is no recommendation

https://gerrit.wikimedia.org/r/1114036

Review provided, moving back to Doing.

Change #1118509 had a related patch set uploaded (by Urbanecm; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] refactor(AddLink): ignore rows with `null` in Store

https://gerrit.wikimedia.org/r/1118509

Change #1118509 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] refactor(AddLink): ignore rows with `null` in Store

https://gerrit.wikimedia.org/r/1118509

Change #1118811 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.15] refactor(AddLink): ignore rows with `null` in Store

https://gerrit.wikimedia.org/r/1118811

Change #1118811 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.15] refactor(AddLink): ignore rows with `null` in Store

https://gerrit.wikimedia.org/r/1118811

Mentioned in SAL (#wikimedia-operations) [2025-02-11T14:22:02Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with null in Store (T382270)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-11T14:24:59Z] <urbanecm@deploy2002> phuedx, migr, urbanecm: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with null in Store (T382270)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-11T14:33:39Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with null in Store (T382270)]] (duration: 11m 36s)

Change #1114036 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] feat(AddLink): store null if there is no recommendation

https://gerrit.wikimedia.org/r/1114036

Change #1119116 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.16] feat(AddLink): store null if there is no recommendation

https://gerrit.wikimedia.org/r/1119116

Change #1119116 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.16] feat(AddLink): store null if there is no recommendation

https://gerrit.wikimedia.org/r/1119116

Mentioned in SAL (#wikimedia-operations) [2025-02-12T14:36:41Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1119115|refactor(AddLink): Make eval steps more legible]], [[gerrit:1119116|feat(AddLink): store null if there is no recommendation (T382270)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-12T14:39:41Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde, migr: Backport for [[gerrit:1119115|refactor(AddLink): Make eval steps more legible]], [[gerrit:1119116|feat(AddLink): store null if there is no recommendation (T382270)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-12T14:48:29Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1119115|refactor(AddLink): Make eval steps more legible]], [[gerrit:1119116|feat(AddLink): store null if there is no recommendation (T382270)]] (duration: 11m 47s)

Change #1118510 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] metrics(AddLink): track outcomes of refreshLinkRecommendations

https://gerrit.wikimedia.org/r/1118510

Change #1118510 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] metrics(AddLink): track outcomes of refreshLinkRecommendations

https://gerrit.wikimedia.org/r/1118510