Maintenance script for updating recommendations to newer dataset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kostajh
	Jun 8 2021, 10:31 AM

Description

T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types will result in updated datasets. It would be nice to update cached recommendations to use the newest dataset / model.

Potentially, refreshLinkRecommendations.php could include a loop to iterate over existing link recommendations and find ones that don't correspond to the latest dataset/model hash. It would be best not to just drop all of these because the task queue could drop to zero. Instead, it could drop some number of older cached recommendations per run (and hope that refreshLinkRecommendations.php continue to fill up the queue), or it could iterate over old cached recommendations, drop individually and make requests to refresh individually.

Details

Subject	Repo	Branch	Lines +/-
Add Link: store when tasks were generated	mediawiki/extensions/GrowthExperiments	wmf/1.37.0-wmf.19	+50 -9
Add Link: store when tasks were generated	mediawiki/extensions/GrowthExperiments	master	+50 -9
Add maintenance script to revalidate link recommendations	mediawiki/extensions/GrowthExperiments	master	+302 -18
Split out the link recommendation fetch/validate/store logic	mediawiki/extensions/GrowthExperiments	master	+338 -214

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	MMiller_WMF	T252822 [EPIC] Growth: "add a link" structured task 1.0
Resolved	MMiller_WMF	T253278 Add a link: link recommendation algorithm
Resolved	MGerlach	T279434 Add a link: algorithm improvements: Define filter for not linking specific articles types
Resolved	Tgr	T284551 Maintenance script for updating recommendations to newer dataset

Event Timeline

kostajh created this task.Jun 8 2021, 10:31 AM

kostajh triaged this task as Medium priority.Jun 8 2021, 3:11 PM

kostajh moved this task from Incoming to Ready for Development on the Growth-Team (Sprint 0 (Growth Team)) board.

kostajh removed a project: Growth-Team-Filtering.

I think we should extract the "given an article, fetch, evaluate and store a recommendation" logic to a helper class, and make it configurable - currently the attempt is considered failed if the page has a in the database already, that could be optionally changed to a version check. The hacky part of the maintenance script (replacing the configuration loader & search engine) is not needed for this so the helper could be reused without a dependency on custom service setup.

After that, we can just write a new maintenance script for iterating through the DB and refreshing old recommendations. It's a bit slower than having the version in a separate table column so it can be indexed, but I don't think that's going to be a problem in practice.

Wrt dropping, I think it would be OK to just update the relevant table row instead (and skip sending a search index update).

In T284551#7147096, @Tgr wrote:

Wrt dropping, I think it would be OK to just update the relevant table row instead (and skip sending a search index update).

Yeah that makes sense.

In T284551#7147091, @Tgr wrote:

I think we should extract the "given an article, fetch, evaluate and store a recommendation" logic to a helper class, and make it configurable - currently the attempt is considered failed if the page has a in the database already, that could be optionally changed to a version check. The hacky part of the maintenance script (replacing the configuration loader & search engine) is not needed for this so the helper could be reused without a dependency on custom service setup.

Sounds good. Would be nice if the helper could be used in a job as well – for example, when the "No suggestions available" dialog is shown, we could enqueue a job to try to refresh; or when we invalidate a suggestion because an edit was made to the page, that could also be a good time to enqueue a job to refresh suggestions for the page.

After that, we can just write a new maintenance script for iterating through the DB and refreshing old recommendations. It's a bit slower than having the version in a separate table column so it can be indexed, but I don't think that's going to be a problem in practice.

Sounds good to me.

kostajh mentioned this in T276860: Review test coverage for add link feature.Jun 28 2021, 10:32 AM

Tgr mentioned this in T284666: Add a link: unnecessary articles on units are often suggested.Jul 27 2021, 10:43 PM

Tgr claimed this task.Aug 10 2021, 11:42 AM

Tgr mentioned this in T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index.

Tgr moved this task from Ready for Development to In Progress on the Growth-Team (Sprint 0 (Growth Team)) board.

Hm, it looks like we never got around to adding dates in T271603: Add a link engineering: Recommendation version so the only comparable field (in the sense that you can tell compare two values and tell which one is better) is the format version, which was intended for rare unusual changes which make reading from the API forwards incompatible. So if we want to identify outdated recommendations, we could

use a list of old dataset IDs but those are per language so it would be very awkward
use a list of old application IDs but there are way too many of those (the ID changes with every mwaddlink commit)
increase the format version, somewhat contrary to its intended meaning
just regenerate all the tasks

We also do not keep the date of when the recommendation was generated, which would be needed for T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index (where all recommendations are broken within a certain known date range).

@kostajh what do you think? I'd go with regenerating everything (on the wikis where Add Link was deployed before the mwaddlink improvements) as the vast majority of tasks is probably still from the initial run of the refresh script, and we don't really have a better method for T282873 anyway. At the same time, we should probably add dataset dates to the API and generation dates to the link recommendations so this doesn't become a problem the next time.

Change 711242 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Split out the link recommendation fetch/validate/store logic

https://gerrit.wikimedia.org/r/711242

gerritbot added a project: Patch-For-Review.Aug 11 2021, 1:16 AM

Change 711247 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] [WIP] Add maintenance script to revalidate link recommendations

https://gerrit.wikimedia.org/r/711247

In T284551#7274398, @Tgr wrote:

Hm, it looks like we never got around to adding dates in T271603: Add a link engineering: Recommendation version so the only comparable field (in the sense that you can tell compare two values and tell which one is better) is the format version, which was intended for rare unusual changes which make reading from the API forwards incompatible. So if we want to identify outdated recommendations, we could

use a list of old dataset IDs but those are per language so it would be very awkward

use a list of old application IDs but there are way too many of those (the ID changes with every mwaddlink commit)

increase the format version, somewhat contrary to its intended meaning

just regenerate all the tasks

We know the newest dataset checksum for each language (e.g. https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cswiki/cswiki.linkmodel.json.checksum) so the script could find that, then iterate over the rows in the table to find ones that don't match?

We also do not keep the date of when the recommendation was generated, which would be needed for T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index (where all recommendations are broken within a certain known date range).

@kostajh what do you think? I'd go with regenerating everything (on the wikis where Add Link was deployed before the mwaddlink improvements) as the vast majority of tasks is probably still from the initial run of the refresh script, and we don't really have a better method for T282873 anyway. At the same time, we should probably add dataset dates to the API and generation dates to the link recommendations so this doesn't become a problem the next time.

Adding dates seems like a good idea, yes. Should we add a column to the table or just add it to the blob in the gelr_data field?

As for refreshing all tasks – I suggested an alternative to refreshing everything above; refreshing all tasks seems like it would be kind of complicated because we need to keep track of the start/end points over multiple runs of the script.

In T284551#7274835, @kostajh wrote:

We know the newest dataset checksum for each language (e.g. https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cswiki/cswiki.linkmodel.json.checksum) so the script could find that, then iterate over the rows in the table to find ones that don't match?

Yeah, that would work. I wanted to avoid per-language settings but maybe it's the easiest.

Adding dates seems like a good idea, yes. Should we add a column to the table or just add it to the blob in the gelr_data field?

A blob, I think - schema changes are a lot of effort and I doubt we'd need it often. Even when it would be useful, iterating through 30K rows doesn't take too long.

As for refreshing all tasks – I suggested an alternative to refreshing everything above; refreshing all tasks seems like it would be kind of complicated because we need to keep track of the start/end points over multiple runs of the script.

We'd only run it once (or as needed, manually) and let it run as long as needed. I don't think there is much point in making it a scheduled script.

In T284551#7275032, @Tgr wrote:

In T284551#7274835, @kostajh wrote:

We know the newest dataset checksum for each language (e.g. https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cswiki/cswiki.linkmodel.json.checksum) so the script could find that, then iterate over the rows in the table to find ones that don't match?

Yeah, that would work. I wanted to avoid per-language settings but maybe it's the easiest.

Adding dates seems like a good idea, yes. Should we add a column to the table or just add it to the blob in the gelr_data field?

A blob, I think - schema changes are a lot of effort and I doubt we'd need it often. Even when it would be useful, iterating through 30K rows doesn't take too long.

Sounds good!

As for refreshing all tasks – I suggested an alternative to refreshing everything above; refreshing all tasks seems like it would be kind of complicated because we need to keep track of the start/end points over multiple runs of the script.

We'd only run it once (or as needed, manually) and let it run as long as needed. I don't think there is much point in making it a scheduled script.

I was thinking that because the script would take many hours to refresh all stored suggestions (optimistically allocating 3 seconds per suggestion would require about 20 hours to complete), there would be some probability that the script would be interrupted while running, but I don't know much about long-running scripts on the maintenance servers and if a process could be expected to not be interrupted for that period of time.

In T284551#7275203, @kostajh wrote:

I was thinking that because the script would take many hours to refresh all stored suggestions (optimistically allocating 3 seconds per suggestion would require about 20 hours to complete), there would be some probability that the script would be interrupted while running, but I don't know much about long-running scripts on the maintenance servers and if a process could be expected to not be interrupted for that period of time.

In theory long-running scripts are fine (for the big schema migrations multiple weeks is not unusual) but of course it would stop if it hits a bug, ops sometimes depool a server etc. I think the standard way to handle that is to peridically write the DB offset to the output, tee the output to a logfile when running, and have a parameter for continuing from a given offset. So if the script is stopped, only the last batch needs to be redone.

Tgr moved this task from In Progress to Code Review on the Growth-Team (Sprint 0 (Growth Team)) board.Aug 12 2021, 12:32 AM

Change 711242 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Split out the link recommendation fetch/validate/store logic

https://gerrit.wikimedia.org/r/711242

ReleaseTaggerBot added a project: MW-1.37-notes (1.37.0-wmf.19; 2021-08-16).Aug 12 2021, 8:00 PM

Change 712830 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add Link: store when tasks where generated

https://gerrit.wikimedia.org/r/712830

kostajh moved this task from Code Review to QA on the Growth-Team (Sprint 0 (Growth Team)) board.Aug 16 2021, 8:47 PM

Change 711247 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add maintenance script to revalidate link recommendations

https://gerrit.wikimedia.org/r/711247

Change 712830 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add Link: store when tasks were generated

https://gerrit.wikimedia.org/r/712830

Maintenance_bot removed a project: Patch-For-Review.Aug 23 2021, 9:10 AM

ReleaseTaggerBot edited projects, added MW-1.37-notes (1.37.0-wmf.20; 2021-08-23); removed MW-1.37-notes (1.37.0-wmf.19; 2021-08-16).Aug 23 2021, 10:00 AM

Change 714158 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.19] Add Link: store when tasks were generated

https://gerrit.wikimedia.org/r/714158

gerritbot added a project: Patch-For-Review.Aug 23 2021, 6:04 PM

Change 714158 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.19] Add Link: store when tasks were generated

https://gerrit.wikimedia.org/r/714158

Mentioned in SAL (#wikimedia-operations) [2021-08-23T18:56:12Z] <tgr@deploy1002> Synchronized php-1.37.0-wmf.19/extensions/GrowthExperiments: Backport: [[gerrit:714158|Add Link: store when tasks were generated (T284551)]] (duration: 00m 57s)

ReleaseTaggerBot edited projects, added MW-1.37-notes (1.37.0-wmf.19; 2021-08-16); removed MW-1.37-notes (1.37.0-wmf.20; 2021-08-23).Aug 23 2021, 7:00 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 23 2021, 7:10 PM

Actually running the script is tracked in T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index.

In T284551#7303851, @Tgr wrote:

Actually running the script is tracked in T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index.

Then I think we can mark this task as resolved (cc @Etonkovidova)

Maintenance script for updating recommendations to newer datasetClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Maintenance script for updating recommendations to newer dataset
Closed, ResolvedPublic
Actions

Related Objects
Search...