Page MenuHomePhabricator

Add Link: Work around 10K search result set limit in fixLinkRecommendationData.php
Closed, ResolvedPublic

Description

fixLinkRecommendationData.php in GrowthExperiments finds all pages with a certain search index entry and checks if the DB is consistent with the search index. Currently this breaks because search result offsets are capped at 10K. We need to work around that in some way. (We used the quick hack of reversing search results sorting to get up to 20K but that still won't be enough in the long term.) Either find a way to iterate without using offsets (in theory easy but would require exposing more ElasticSearch options via CirrusSearch) or segment searches by articletopic (inelegant and not great for performance but easy to do).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kostajh triaged this task as Medium priority.Jun 10 2021, 8:06 AM

Change 712847 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add --reverse option to fixLinkRecommendationData.php

https://gerrit.wikimedia.org/r/712847

Change 712847 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add --reverse option to fixLinkRecommendationData.php

https://gerrit.wikimedia.org/r/712847

The random flag added in rEGREaba24866d300: fixLinkRecommendationData: allow random sampling sort of helps with this too. @Tgr is there anything else we want to do here, or should we resolve the task?

The ideal fix would be to split the list of ORES topics into floor(10000 / minimumTasksPerTopic) length chunks, and run the search ones for each of those chunks. It will slow things down a bit since most tasks are in several topics, but this isn't a performance-sensitive task anyway.

Change 714450 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: stay under 10K search limit

https://gerrit.wikimedia.org/r/714450

kostajh claimed this task.
kostajh moved this task from Backlog to Done / QA on the Add-Link board.
kostajh added a subscriber: Etonkovidova.

I've +2'ed the patch & am resolving this task (cc @Etonkovidova). Thanks!

Change 714450 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: stay under 10K search limit

https://gerrit.wikimedia.org/r/714450

Change 715825 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.20] fixLinkRecommendationData: stay under 10K search limit

https://gerrit.wikimedia.org/r/715825

Change 715825 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.20] fixLinkRecommendationData: stay under 10K search limit

https://gerrit.wikimedia.org/r/715825

Mentioned in SAL (#wikimedia-operations) [2021-09-01T23:27:21Z] <urbanecm@deploy1002> Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 0bd65426494d4df981141650211e27e17c98ee0c: fixLinkRecommendationData: stay under 10K search limit (T284531) (duration: 01m 06s)

The ideal fix would be to split the list of ORES topics into floor(10000 / minimumTasksPerTopic) length chunks, and run the search ones for each of those chunks. It will slow things down a bit since most tasks are in several topics, but this isn't a performance-sensitive task anyway.

Unfortunately this doesn't reliably work because articles can be in more than one topic, so even after the 500 quota was met, the script will add further articles while working on filling up the other topics. E.g. biography (a frequent topic + at the beginning of the alphabet) has over 7K tasks now in huwiki, even though the task total over all topics is only 23K.

I guess we'll just have to search each topic, one by one. Interestingly, this seems to make the script faster in production - maybe the CirrusSearch queries end up more efficient? Or there's less paging (which is relatively inefficient)?

Change 716083 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: Try harder to avoid >10K result sets

https://gerrit.wikimedia.org/r/716083

Change 716083 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: Try harder to avoid >10K result sets

https://gerrit.wikimedia.org/r/716083

Change 716491 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.21] fixLinkRecommendationData: Try harder to avoid >10K result sets

https://gerrit.wikimedia.org/r/716491

Change 716491 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.21] fixLinkRecommendationData: Try harder to avoid >10K result sets

https://gerrit.wikimedia.org/r/716491

Mentioned in SAL (#wikimedia-operations) [2021-09-03T00:31:09Z] <tgr@deploy1002> Synchronized php-1.37.0-wmf.21/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: Backport: [[gerrit:716491|fixLinkRecommendationData: Try harder to avoid >10K result sets (T284531)]] (duration: 00m 58s)

All patches merged; moving to Watching as the cronjob using this has just been set up.

I was monitoring on several wikis (those where GrowthExperiments were newly deployed and old ones). All seem to be in place - closing as Resolved.