Page MenuHomePhabricator

growthexperiments-refreshlinkrecommendations periodic jobs error out on search too busy
Open, HighPublicPRODUCTION ERROR

Description

Error
trace
RuntimeException from line 328 of /srv/mediawiki/php-1.45.0-wmf.2/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php: Search error: Search is currently too busy. Please try again later.
#0 /srv/mediawiki/php-1.45.0-wmf.2/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php(435): GrowthExperiments\Maintenance\RefreshLinkRecommendations->findArticlesInTopic('visual-arts')
#1 /srv/mediawiki/php-1.45.0-wmf.2/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php(148): GrowthExperiments\Maintenance\RefreshLinkRecommendations->refreshViaOresTopics(false)
#2 /srv/mediawiki/php-1.45.0-wmf.2/maintenance/includes/MaintenanceRunner.php(691): GrowthExperiments\Maintenance\RefreshLinkRecommendations->execute()
#3 /srv/mediawiki/php-1.45.0-wmf.2/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#4 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
#5 {main}
Impact

Since we exit the foreachwiki loop on error, the run stops. Subsequent runs usually work, but it creates alert noise.

Notes

We can use T395245: Add a flag to the mwscript wrapper to set +e when required to continue running on error, at the risk of obscuring repeated failure. Another option would be for Growth-Team
to implement retries on the Search error RuntimeException

Event Timeline

Clement_Goubert triaged this task as High priority.

@Clement_Goubert: Retries sound like the preferred solution if we can implement them in a sensible way. Could you point me to an example where they have been implemented particularly well?

I agree with prioritizing this as "High" because ideally we get some mitigation in place soon. Otherwise, we have a lot of alert-noise in our internal Growth engineering channel that distracts from actual errors and trains us in alert fatigue.

@Michael I'm not well versed enough in MediaWiki code to know where retries are implemented, but I imagine a loop with a try-catch for the search exception, exponential backoff and a limit for the number of retries would be the general pattern.

@Clement_Goubert: Retries sound like the preferred solution if we can implement them in a sensible way. Could you point me to an example where they have been implemented particularly well?

A good example of what a retry logic can look like is https://doc.wikimedia.org/wmflib/master/api/wmflib.decorators.html#wmflib.decorators.retry, which is used as base for our automation.

Change #1150711 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance::growthexperiment: Ignore foreachwiki errors

https://gerrit.wikimedia.org/r/1150711

Change #1150711 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance::growthexperiment: Ignore foreachwiki errors

https://gerrit.wikimedia.org/r/1150711

The loop for this job now ignores errors and continues, as it was in the old wrapper.

I have deleted the failed jobs to clear alerting.