Page MenuHomePhabricator

Maintenance script for updating recommendations to newer dataset
Closed, ResolvedPublic

Description

T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types will result in updated datasets. It would be nice to update cached recommendations to use the newest dataset / model.

Potentially, refreshLinkRecommendations.php could include a loop to iterate over existing link recommendations and find ones that don't correspond to the latest dataset/model hash. It would be best not to just drop all of these because the task queue could drop to zero. Instead, it could drop some number of older cached recommendations per run (and hope that refreshLinkRecommendations.php continue to fill up the queue), or it could iterate over old cached recommendations, drop individually and make requests to refresh individually.

Event Timeline

I think we should extract the "given an article, fetch, evaluate and store a recommendation" logic to a helper class, and make it configurable - currently the attempt is considered failed if the page has a in the database already, that could be optionally changed to a version check. The hacky part of the maintenance script (replacing the configuration loader & search engine) is not needed for this so the helper could be reused without a dependency on custom service setup.

After that, we can just write a new maintenance script for iterating through the DB and refreshing old recommendations. It's a bit slower than having the version in a separate table column so it can be indexed, but I don't think that's going to be a problem in practice.

Wrt dropping, I think it would be OK to just update the relevant table row instead (and skip sending a search index update).

Wrt dropping, I think it would be OK to just update the relevant table row instead (and skip sending a search index update).

Yeah that makes sense.

I think we should extract the "given an article, fetch, evaluate and store a recommendation" logic to a helper class, and make it configurable - currently the attempt is considered failed if the page has a in the database already, that could be optionally changed to a version check. The hacky part of the maintenance script (replacing the configuration loader & search engine) is not needed for this so the helper could be reused without a dependency on custom service setup.

Sounds good. Would be nice if the helper could be used in a job as well – for example, when the "No suggestions available" dialog is shown, we could enqueue a job to try to refresh; or when we invalidate a suggestion because an edit was made to the page, that could also be a good time to enqueue a job to refresh suggestions for the page.

After that, we can just write a new maintenance script for iterating through the DB and refreshing old recommendations. It's a bit slower than having the version in a separate table column so it can be indexed, but I don't think that's going to be a problem in practice.

Sounds good to me.

Hm, it looks like we never got around to adding dates in T271603: Add a link engineering: Recommendation version so the only comparable field (in the sense that you can tell compare two values and tell which one is better) is the format version, which was intended for rare unusual changes which make reading from the API forwards incompatible. So if we want to identify outdated recommendations, we could

  • use a list of old dataset IDs but those are per language so it would be very awkward
  • use a list of old application IDs but there are way too many of those (the ID changes with every mwaddlink commit)
  • increase the format version, somewhat contrary to its intended meaning
  • just regenerate all the tasks

We also do not keep the date of when the recommendation was generated, which would be needed for T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index (where all recommendations are broken within a certain known date range).

@kostajh what do you think? I'd go with regenerating everything (on the wikis where Add Link was deployed before the mwaddlink improvements) as the vast majority of tasks is probably still from the initial run of the refresh script, and we don't really have a better method for T282873 anyway. At the same time, we should probably add dataset dates to the API and generation dates to the link recommendations so this doesn't become a problem the next time.

Change 711242 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Split out the link recommendation fetch/validate/store logic

https://gerrit.wikimedia.org/r/711242

Change 711247 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] [WIP] Add maintenance script to revalidate link recommendations

https://gerrit.wikimedia.org/r/711247

Hm, it looks like we never got around to adding dates in T271603: Add a link engineering: Recommendation version so the only comparable field (in the sense that you can tell compare two values and tell which one is better) is the format version, which was intended for rare unusual changes which make reading from the API forwards incompatible. So if we want to identify outdated recommendations, we could

  • use a list of old dataset IDs but those are per language so it would be very awkward
  • use a list of old application IDs but there are way too many of those (the ID changes with every mwaddlink commit)
  • increase the format version, somewhat contrary to its intended meaning
  • just regenerate all the tasks

We know the newest dataset checksum for each language (e.g. https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cswiki/cswiki.linkmodel.json.checksum) so the script could find that, then iterate over the rows in the table to find ones that don't match?

We also do not keep the date of when the recommendation was generated, which would be needed for T282873: Add Link: Fix production discrepancies between the link recommendation table and the search index (where all recommendations are broken within a certain known date range).

@kostajh what do you think? I'd go with regenerating everything (on the wikis where Add Link was deployed before the mwaddlink improvements) as the vast majority of tasks is probably still from the initial run of the refresh script, and we don't really have a better method for T282873 anyway. At the same time, we should probably add dataset dates to the API and generation dates to the link recommendations so this doesn't become a problem the next time.

Adding dates seems like a good idea, yes. Should we add a column to the table or just add it to the blob in the gelr_data field?

As for refreshing all tasks – I suggested an alternative to refreshing everything above; refreshing all tasks seems like it would be kind of complicated because we need to keep track of the start/end points over multiple runs of the script.

We know the newest dataset checksum for each language (e.g. https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cswiki/cswiki.linkmodel.json.checksum) so the script could find that, then iterate over the rows in the table to find ones that don't match?

Yeah, that would work. I wanted to avoid per-language settings but maybe it's the easiest.

Adding dates seems like a good idea, yes. Should we add a column to the table or just add it to the blob in the gelr_data field?

A blob, I think - schema changes are a lot of effort and I doubt we'd need it often. Even when it would be useful, iterating through 30K rows doesn't take too long.

As for refreshing all tasks – I suggested an alternative to refreshing everything above; refreshing all tasks seems like it would be kind of complicated because we need to keep track of the start/end points over multiple runs of the script.

We'd only run it once (or as needed, manually) and let it run as long as needed. I don't think there is much point in making it a scheduled script.

We know the newest dataset checksum for each language (e.g. https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cswiki/cswiki.linkmodel.json.checksum) so the script could find that, then iterate over the rows in the table to find ones that don't match?

Yeah, that would work. I wanted to avoid per-language settings but maybe it's the easiest.

Adding dates seems like a good idea, yes. Should we add a column to the table or just add it to the blob in the gelr_data field?

A blob, I think - schema changes are a lot of effort and I doubt we'd need it often. Even when it would be useful, iterating through 30K rows doesn't take too long.

Sounds good!

As for refreshing all tasks – I suggested an alternative to refreshing everything above; refreshing all tasks seems like it would be kind of complicated because we need to keep track of the start/end points over multiple runs of the script.

We'd only run it once (or as needed, manually) and let it run as long as needed. I don't think there is much point in making it a scheduled script.

I was thinking that because the script would take many hours to refresh all stored suggestions (optimistically allocating 3 seconds per suggestion would require about 20 hours to complete), there would be some probability that the script would be interrupted while running, but I don't know much about long-running scripts on the maintenance servers and if a process could be expected to not be interrupted for that period of time.

I was thinking that because the script would take many hours to refresh all stored suggestions (optimistically allocating 3 seconds per suggestion would require about 20 hours to complete), there would be some probability that the script would be interrupted while running, but I don't know much about long-running scripts on the maintenance servers and if a process could be expected to not be interrupted for that period of time.

In theory long-running scripts are fine (for the big schema migrations multiple weeks is not unusual) but of course it would stop if it hits a bug, ops sometimes depool a server etc. I think the standard way to handle that is to peridically write the DB offset to the output, tee the output to a logfile when running, and have a parameter for continuing from a given offset. So if the script is stopped, only the last batch needs to be redone.

Change 711242 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Split out the link recommendation fetch/validate/store logic

https://gerrit.wikimedia.org/r/711242

Change 712830 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add Link: store when tasks where generated

https://gerrit.wikimedia.org/r/712830

Change 711247 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add maintenance script to revalidate link recommendations

https://gerrit.wikimedia.org/r/711247

Change 712830 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add Link: store when tasks were generated

https://gerrit.wikimedia.org/r/712830

Change 714158 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.19] Add Link: store when tasks were generated

https://gerrit.wikimedia.org/r/714158

Change 714158 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.19] Add Link: store when tasks were generated

https://gerrit.wikimedia.org/r/714158

Mentioned in SAL (#wikimedia-operations) [2021-08-23T18:56:12Z] <tgr@deploy1002> Synchronized php-1.37.0-wmf.19/extensions/GrowthExperiments: Backport: [[gerrit:714158|Add Link: store when tasks were generated (T284551)]] (duration: 00m 57s)

kostajh added a subscriber: Etonkovidova.

Then I think we can mark this task as resolved (cc @Etonkovidova)