fixLinkRecommendationData.php in GrowthExperiments finds all pages with a certain search index entry and checks if the DB is consistent with the search index. Currently this breaks because search result offsets are capped at 10K. We need to work around that in some way. (We used the quick hack of reversing search results sorting to get up to 20K but that still won't be enough in the long term.) Either find a way to iterate without using offsets (in theory easy but would require exposing more ElasticSearch options via CirrusSearch) or segment searches by articletopic (inelegant and not great for performance but easy to do).
- Mentioned In
- T289550: Add Link: Set up cronjob for collecting statsd metrics about dangling search index entries
T283606: Add a link: too many articles have no suggestions upon arrival
- Mentioned Here
- rEGRE0bd65426494d: fixLinkRecommendationData: stay under 10K search limit
rEGREaba24866d300: fixLinkRecommendationData: allow random sampling
The random flag added in rEGREaba24866d300: fixLinkRecommendationData: allow random sampling sort of helps with this too. @Tgr is there anything else we want to do here, or should we resolve the task?
The ideal fix would be to split the list of ORES topics into floor(10000 / minimumTasksPerTopic) length chunks, and run the search ones for each of those chunks. It will slow things down a bit since most tasks are in several topics, but this isn't a performance-sensitive task anyway.
Mentioned in SAL (#wikimedia-operations) [2021-09-01T23:27:21Z] <urbanecm@deploy1002> Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 0bd65426494d4df981141650211e27e17c98ee0c: fixLinkRecommendationData: stay under 10K search limit (T284531) (duration: 01m 06s)
Unfortunately this doesn't reliably work because articles can be in more than one topic, so even after the 500 quota was met, the script will add further articles while working on filling up the other topics. E.g. biography (a frequent topic + at the beginning of the alphabet) has over 7K tasks now in huwiki, even though the task total over all topics is only 23K.
I guess we'll just have to search each topic, one by one. Interestingly, this seems to make the script faster in production - maybe the CirrusSearch queries end up more efficient? Or there's less paging (which is relatively inefficient)?
Mentioned in SAL (#wikimedia-operations) [2021-09-03T00:31:09Z] <tgr@deploy1002> Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: Backport: [[gerrit:716491|fixLinkRecommendationData: Try harder to avoid >10K result sets (T284531)]] (duration: 00m 58s)