Page MenuHomePhabricator

Add Link: Fix production discrepancies between the link recommendation table and the search index
Closed, ResolvedPublic

Description

As per T261407#7088136, a significant number of task creation events did not get processed (hopefully a one-time problem). When a task is stored in the DB, that disqualifies it from being generated again - the assumption is that there is an EventGate event somewhere in the pipeline; if we emit another event, there might be a few hours between the first event arriving (and thus the task becoming available) and the second event arriving, if the task gets done or invalidated during that time, the second event will bring the index and DB out of sync.

That means all articles for which the EventGate event got lost are permanently disqualified from becoming tasks. (Or semi-permanently - an edit to the article will clear the DB record.) On wikis where the number of valid task candidates is not that huge, this is can become a problem. There should be some way fix such pages, either automatically in refreshLinkRecommendations.php or manually in some maintenance script. (fixLinkRecommendationData.php would do it, but it's barred from running in production.)

Event Timeline

So basically we need to find the DB entries which do not match the search index, and either add them to the search index or delete them from the DB. The latter is already implemented in fixLinkRecommendationData.php but disabled in production (since some level of discrepancy between the DB and index is always expected, as DB writes are immediate and index writes take effect in a few hours, so trying to "fix" those would actually introduce a permanent discrepancy for those pages). The former is a lot faster (which might or might not be important, depending on how often this happens).

kostajh triaged this task as Medium priority.May 19 2021, 8:09 AM
MMiller_WMF raised the priority of this task from Medium to Needs Triage.Jun 7 2021, 5:27 AM
MMiller_WMF triaged this task as Medium priority.
MMiller_WMF raised the priority of this task from Medium to High.Jun 7 2021, 5:11 PM
kostajh added a subscriber: kostajh.

@Tgr tentatively assigning to you.

kostajh renamed this task from Add Link: Fix production discrepancies between the link recommmendation table and the search index to Add Link: Fix production discrepancies between the link recommendation table and the search index.Jun 21 2021, 9:09 AM

Probably can be done alongside T284551: Maintenance script for updating recommendations to newer dataset, just use the recommendation date instead of the version as a criteria for refreshing. Not the most direct or efficient way to handle it, but as a one-off with low user impact this issue is not worth a dedicated script.

Turns out we don't have dates for recommendations. That should probably be fixed, but the dataset has been changed since this error happened so running the update script for T284551: Maintenance script for updating recommendations to newer dataset will take care of this issue as well.

Turns out we don't have dates for recommendations. That should probably be fixed, but the dataset has been changed since this error happened so running the update script for T284551: Maintenance script for updating recommendations to newer dataset will take care of this issue as well.

Just noting that adding dates happened in rEGRE990022df9301: Add Link: store when tasks were generated

So, the next step is to run the maintenance script.

On testwiki:

tgr@stat1008:~$ cat /srv/published/datasets/one-off/research-mwaddlink/*/*.linkmodel.json.checksum | cut -d' ' -f1 > allowedDatasetChecksums.txt
tgr@mwmaint2002:~$ mwscript extensions/GrowthExperiments/maintenance/listTaskCounts.php --wiki=testwiki --tasktype=link-recommendation --output json | jq '.taskTypeCounts."link-recommendation"'
632

tgr@mwmaint2002:~$ mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=testwiki --exceptDatasetChecksums=allowedDatasetChecksums.txt --dry-run
Revalidating link recommendations:
  fetching task batch starting with page 0
  fetching task batch starting with page 114473
Done; replaced 580, discarded 0
wikiadmin@10.192.48.134(testwiki)> select count(*) from growthexperiments_link_recommendations;
+----------+
| count(*) |
+----------+
|      664 |
+----------+
1 row in set (0.00 sec)

wikiadmin@10.192.48.134(testwiki)> select count(*) from growthexperiments_link_recommendations where gelr_data not like '%4160fb0cacbe4086a53bbf667ac9c1c961657f7b20c1b209bd4752c20653ec5b%';
+----------+
| count(*) |
+----------+
|      584 |
+----------+
1 row in set (0.00 sec)

tgr@mwmaint2002:~$ mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=testwiki --exceptDatasetChecksums=allowedDatasetChecksums.txt
Revalidating link recommendations:
  fetching task batch starting with page 0
  fetching task batch starting with page 114473
Done; replaced 527, discarded 53
wikiadmin@10.192.32.134(testwiki)> select count(*) from growthexperiments_link_recommendations where gelr_data like '%4160fb0cacbe4086a53bbf667ac9c1c961657f7b20c1b209bd4752c20653ec5b%';
+----------+
| count(*) |
+----------+
|      595 |
+----------+
1 row in set (0.01 sec)

wikiadmin@10.192.32.134(testwiki)> select count(*) from growthexperiments_link_recommendations where gelr_data like '%timestamp%';
+----------+
| count(*) |
+----------+
|      525 |
+----------+
1 row in set (0.01 sec)

Looks good.

Command to run:

for WIKI in `expanddblist /srv/mediawiki-staging/dblists/growthexperiments.dblist`; do mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=$WIKI --exceptDatasetChecksums=allowedDatasetChecksums.txt; done

Probably takes a few hours, maybe a few days. Writes to growthexperiments_link_recommendations on extension1. Light on the DB, most of the time is spent waiting for HTTP responses. Writes secondary data, can be interrupted without risk.

Deploy window scheduled for 2PM PDT on Tuesday.

Command to run:

for WIKI in `expanddblist /srv/mediawiki-staging/dblists/growthexperiments.dblist`; do mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=$WIKI --exceptDatasetChecksums=allowedDatasetChecksums.txt; done

Probably takes a few hours, maybe a few days. Writes to growthexperiments_link_recommendations on extension1. Light on the DB, most of the time is spent waiting for HTTP responses. Writes secondary data, can be interrupted without risk.

👍 looks good to me, thanks!

Mentioned in SAL (#wikimedia-operations) [2021-08-24T21:10:23Z] <tgr> running extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php on various wikis per T282873#7303828

At 40% of arwiki (3M pages processed out of 7.5M) the task pool dropped by about 20% (14K tasks left out of 17K). Not great but survivable.

This menas a speed of 2K tasks per hour, for our 11 Add Link wikis with ~20K tasks each that's an estimated 110 hour (about five days). More than I expected but not tragic.

Both estimates assume that 40% of pages correlate to 40% of tasks (ie. tasks have a random distribution of page IDs) - I imagine content page IDs are more sparse in recent pages, and older pages are less likely to be edit and thus more likely to stay as tasks for long, but I doubt either of those effects are large. They also assume that most tasks need to be updated - the grafana charts imply a few hundred tasks are naturally rotated per day, so maybe that's not true, but if so, only better.

So I think it's fine to leave this running without oversight for a while. In any case the worst that can happen is that arwiki won't have any link recommendation tasks by the morning - that would be an annoyance for analytics but still not tragic I think.

I guess I'm overcomplicating this - the script uses batches of 500 so I just need to count the batches -> we have processed 12K tasks or so. That's even better, 70% done.

Come to think of it, the script with the --all option is a bit broken, as it will happily re-process the tasks it has just generated. That shouldn't be an issue now, though.

I guess I'm overcomplicating this - the script uses batches of 500 so I just need to count the batches -> we have processed 12K tasks or so. That's even better, 70% done.

...except of course there are more than 17K database rows, that's the very problem this task is trying to solve.

Anyway, arwiki is finished now. Processed about 22K rows. replaced 10715, discarded 3419 is the script's report, although grafana shows a 6K drop - I guess that's just the search index pipeline taking a few hours? Anyway, the wiki is done now and still has 11K tasks left, so it worked out fine, and there is no reason to think other wikis would perform worse (more recently added wikis should have less of their tasks affected).

This menas a speed of 2K tasks per hour, for our 11 Add Link wikis with ~20K tasks each that's an estimated 110 hour (about five days). More than I expected but not tragic.

Took a lot less than that (finished some time this morning) because only the pilot wikis had outdated tasks.

arwiki: 11K tasks regenerated, 4K tasks discarded; bnwiki: 11K/3K; cswiki: 9K/1K; viwiki: 10K / 1K. Zero for all other wikis.

In terms of unblocking pages with dangling database entries from becoming tasks and increasing the amount of tasks, which was one of the two goals for running the script: arwiki is up from 17K to 20K, bnwiki from 11K to 13K, cswiki from 17K to 19K, viwiki from 14K to 17K. So that went well (though not sure if having the lower counts ever had any kind of user impact).

All wikis showed the same pattern of the task counts dipping for 4-5 hours then increasing, so I guess that's the real-world latency of the search pipeline. bnwiki, which is the smallest, was at 6K tasks at its negative peak, which seems still decent (still dozens of tasks per topic on average). So it seems fine to run this unsupervised in the future if we need to refresh the recommendations again.

Re-checked for general regression on wmf.20 - looks good; the tasks mentioned in the comment are Resolved - resolving this one as well.