Background & Current state
The Growth-Team has several periodic jobs scheduled, which are run using foreachwikiindblist growthexperiments. At times, those jobs are erroring out with errors that are permanent or possibly wiki-specific. However, by default, foreachwikiindblist in Kubernetes terminates the whole execution (including next wikis) on the first errors, even though that error might not affect the next wikis in the list.
A specific example is this failure of growthexperiments-deleteoldsurveys:
enwiki MediaWiki\Config\ConfigException from line 233 of /srv/mediawiki/php-1.45.0-wmf.7/includes/config/EtcdConfig.php: Failed to load configuration from etcd: (curl error: 28) Timeout was reached
enwiki #0 /srv/mediawiki/php-1.45.0-wmf.7/includes/config/EtcdConfig.php(146): MediaWiki\Config\EtcdConfig->load()
enwiki #1 /srv/mediawiki/wmf-config/CommonSettings.php(226): MediaWiki\Config\EtcdConfig->get('eqiad/dbconfig')
enwiki #2 /srv/mediawiki/php-1.45.0-wmf.7/includes/libs/rdbms/lbfactory/LBFactory.php(189): {closure}()
enwiki #3 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/Maintenance.php(1274): Wikimedia\Rdbms\LBFactory->autoReconfigure()
enwiki #4 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/Maintenance.php(1228): MediaWiki\Maintenance\Maintenance->waitForReplication()
enwiki #5 /srv/mediawiki/php-1.45.0-wmf.7/extensions/GrowthExperiments/maintenance/deleteOldSurveys.php(110): MediaWiki\Maintenance\Maintenance->commitTransaction(Object(Wikimedia\Rdbms\DBConnRef), 'GrowthExperimen...')
enwiki #6 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/MaintenanceRunner.php(691): GrowthExperiments\Maintenance\DeleteOldSurveys->execute()
enwiki #7 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
enwiki #8 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
enwiki #9 {main}This is an etcd failure, which we were unable to contact for enwiki. Restarting the job made it successfully finish, which indicates the failure was likely transient. In any way, the Growth team is unable to do much about this error, as maintaining etcd and ensuring its reachability is not our responsibility.
In addition to this, many exceptions are only happening on a particular wiki, because of a specific configuration or content of a specific wiki page. In those cases, it doesn't make a lot of sense to stop the full job, even if the rest of it might finish successfully.
Problem
In the section above, I see several problems:
- Alerts coming from periodic jobs are sometimes deliberately ignored (when the traceback suggests a transient error), which decreases their weight,
- Halting the whole script on the first exception means a periodic job has a significantly lower chance of running for zhwiki than for aawiki (since aawiki runs first, thus is unlikely to encounter an exception prior to it)
- The right action (whether to try continuing or halt) depends on the exception thrown, not on the script (so, mechanisms like FOREACHWIKI_IGNORE_ERRORS are not that useful)
- The previous wrapper, foreachwikiindblist on bare metal, ignored exceptions thrown by the scripts, and always executed the script (meaning this is a change MW-on-k8s brought, and which was not really obvious during the migration)
Solutions
There are several things we might want to consider:
- Within MediaWiki, catch certain exceptions (list of them TBD) and pretend they were not an error
- Unconditionally run the script on all wikis, ensuring the alert only fires at most once
- Something else?