fixLinkRecommendationData.php in GrowthExperiments finds all pages with a certain search index entry and checks if the DB is consistent with the search index. Currently this breaks because search result offsets are capped at 10K. We need to work around that in some way. (We used the quick hack of reversing search results sorting to get up to 20K but that still won't be enough in the long term.) Either find a way to iterate without using offsets (in theory easy but would require exposing more ElasticSearch options via CirrusSearch) or segment searches by articletopic (inelegant and not great for performance but easy to do).
Description
Details
Related Objects
- Mentioned In
- T289550: Add Link: Set up cronjob for collecting statsd metrics about dangling search index entries
T283606: Add a link: too many articles have no suggestions upon arrival - Mentioned Here
- rEGRE0bd65426494d: fixLinkRecommendationData: stay under 10K search limit
rEGREaba24866d300: fixLinkRecommendationData: allow random sampling
Event Timeline
Change 712847 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Add --reverse option to fixLinkRecommendationData.php
Change 712847 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add --reverse option to fixLinkRecommendationData.php
The random flag added in rEGREaba24866d300: fixLinkRecommendationData: allow random sampling sort of helps with this too. @Tgr is there anything else we want to do here, or should we resolve the task?
The ideal fix would be to split the list of ORES topics into floor(10000 / minimumTasksPerTopic) length chunks, and run the search ones for each of those chunks. It will slow things down a bit since most tasks are in several topics, but this isn't a performance-sensitive task anyway.
Change 714450 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: stay under 10K search limit
Change 714450 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: stay under 10K search limit
Change 715825 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.20] fixLinkRecommendationData: stay under 10K search limit
Change 715825 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.20] fixLinkRecommendationData: stay under 10K search limit
Mentioned in SAL (#wikimedia-operations) [2021-09-01T23:27:21Z] <urbanecm@deploy1002> Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 0bd65426494d4df981141650211e27e17c98ee0c: fixLinkRecommendationData: stay under 10K search limit (T284531) (duration: 01m 06s)
Unfortunately this doesn't reliably work because articles can be in more than one topic, so even after the 500 quota was met, the script will add further articles while working on filling up the other topics. E.g. biography (a frequent topic + at the beginning of the alphabet) has over 7K tasks now in huwiki, even though the task total over all topics is only 23K.
I guess we'll just have to search each topic, one by one. Interestingly, this seems to make the script faster in production - maybe the CirrusSearch queries end up more efficient? Or there's less paging (which is relatively inefficient)?
Change 716083 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: Try harder to avoid >10K result sets
Change 716083 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: Try harder to avoid >10K result sets
Change 716491 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.21] fixLinkRecommendationData: Try harder to avoid >10K result sets
Change 716491 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.21] fixLinkRecommendationData: Try harder to avoid >10K result sets
Mentioned in SAL (#wikimedia-operations) [2021-09-03T00:31:09Z] <tgr@deploy1002> Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: Backport: [[gerrit:716491|fixLinkRecommendationData: Try harder to avoid >10K result sets (T284531)]] (duration: 00m 58s)
All patches merged; moving to Watching as the cronjob using this has just been set up.
I was monitoring on several wikis (those where GrowthExperiments were newly deployed and old ones). All seem to be in place - closing as Resolved.