Page MenuHomePhabricator

Review the behaviour of foreachwikiindblist in mw-cron
Open, Needs TriagePublic

Description

Background & Current state

The Growth-Team has several periodic jobs scheduled, which are run using foreachwikiindblist growthexperiments. At times, those jobs are erroring out with errors that are permanent or possibly wiki-specific. However, by default, foreachwikiindblist in Kubernetes terminates the whole execution (including next wikis) on the first errors, even though that error might not affect the next wikis in the list.

A specific example is this failure of growthexperiments-deleteoldsurveys:

enwiki MediaWiki\Config\ConfigException from line 233 of /srv/mediawiki/php-1.45.0-wmf.7/includes/config/EtcdConfig.php: Failed to load configuration from etcd: (curl error: 28) Timeout was reached
enwiki #0 /srv/mediawiki/php-1.45.0-wmf.7/includes/config/EtcdConfig.php(146): MediaWiki\Config\EtcdConfig->load()
enwiki #1 /srv/mediawiki/wmf-config/CommonSettings.php(226): MediaWiki\Config\EtcdConfig->get('eqiad/dbconfig')
enwiki #2 /srv/mediawiki/php-1.45.0-wmf.7/includes/libs/rdbms/lbfactory/LBFactory.php(189): {closure}()
enwiki #3 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/Maintenance.php(1274): Wikimedia\Rdbms\LBFactory->autoReconfigure()
enwiki #4 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/Maintenance.php(1228): MediaWiki\Maintenance\Maintenance->waitForReplication()
enwiki #5 /srv/mediawiki/php-1.45.0-wmf.7/extensions/GrowthExperiments/maintenance/deleteOldSurveys.php(110): MediaWiki\Maintenance\Maintenance->commitTransaction(Object(Wikimedia\Rdbms\DBConnRef), 'GrowthExperimen...')
enwiki #6 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/MaintenanceRunner.php(691): GrowthExperiments\Maintenance\DeleteOldSurveys->execute()
enwiki #7 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
enwiki #8 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
enwiki #9 {main}

This is an etcd failure, which we were unable to contact for enwiki. Restarting the job made it successfully finish, which indicates the failure was likely transient. In any way, the Growth team is unable to do much about this error, as maintaining etcd and ensuring its reachability is not our responsibility.

In addition to this, many exceptions are only happening on a particular wiki, because of a specific configuration or content of a specific wiki page. In those cases, it doesn't make a lot of sense to stop the full job, even if the rest of it might finish successfully.

Problem

In the section above, I see several problems:

  1. Alerts coming from periodic jobs are sometimes deliberately ignored (when the traceback suggests a transient error), which decreases their weight,
  2. Halting the whole script on the first exception means a periodic job has a significantly lower chance of running for zhwiki than for aawiki (since aawiki runs first, thus is unlikely to encounter an exception prior to it)
  3. The right action (whether to try continuing or halt) depends on the exception thrown, not on the script (so, mechanisms like FOREACHWIKI_IGNORE_ERRORS are not that useful)
  4. The previous wrapper, foreachwikiindblist on bare metal, ignored exceptions thrown by the scripts, and always executed the script (meaning this is a change MW-on-k8s brought, and which was not really obvious during the migration)
Solutions

There are several things we might want to consider:

  1. Within MediaWiki, catch certain exceptions (list of them TBD) and pretend they were not an error
  2. Unconditionally run the script on all wikis, ensuring the alert only fires at most once
  3. Something else?

Event Timeline

@Michael, this follows the conversation on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1165480 and in our 1:1. I'm curious if you have anything to add here!

Michael added subscribers: Clement_Goubert, hnowlan.

@Michael, this follows the conversation on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1165480 and in our 1:1. I'm curious if you have anything to add here!

Sounds good so far, thank you!

I'll mention T395249 here as prior context for this. And ServiceOps new in general, and @Clement_Goubert and @hnowlan in particular, are probably the right people to tell us whether this is a good idea and how to move it forward.

For now I'd recommend using FOREACHWIKI_IGNORE_ERRORS to restore previous behaviour. The issue with that is you won't get an alert unless the last wiki fails. This can be done in puppet by adding foreachwiki_ignore_errors => true to your job definition. I can do it for you if you decide on going this way.

In the future, when T341984: Update Kubernetes clusters to 1.31 has been rolled out to both datacenters after the next switchover, we will be able to define podFailurePolicies, which will allow us for example to catch some exit codes for a restart or the job, or treat them as fatal.

It seems like there are two different time horizons to this task, as initially framed:

  1. In the near term, restoring the earlier behavior on mwmaint hosts (i.e., Problem #4) as a error-handling model that may be more appropriate for Growth-Team's use case. My understanding that this is possible using FOREACHWIKI_IGNORE_ERRORS per T398592#10971912, but please correct me if I'm wrong here.
  2. Introducing a more sophisticated error handling model that, for example, allows different error conditions / exceptions to surface different exit status codes (i.e., changes to maintenance scripts themselves and / or the abstractions they're built on) and accompanying support in mw-cron for status-depending handling (Problem #3).

@Urbanecm_WMF - Does that sound right, and if so, has your team explored #1 for your use case?

@Urbanecm_WMF - Could you please confirm whether #1 from T398592#11539714 is correct or not? We'd like to try to confirm that immediate-term need has been met. Longer term, we would try to prioritize #2 once that functionality exists in the relevant maintenance scripts. Please unassign once responded.

@Urbanecm_WMF gentle follow-up, could you help us confirm the status of this?

Hello, thanks for the ping and apologies for my delayed response. We are aware of FOREACHWIKI_IGNORE_ERRORS and we make use of it for some of our periodic jobs. However, stopping the job entirely for most errors makes sense for us, as long as the errors that cause the abortion are not infrastructure-related.

I'm not sure whether we should migrate individual jobs to use FOREACHWIKI_IGNORE_ERRORS (and then migrate them back individually). Maybe it would make sense to (temporarily) revert the behaviour back to what it used to be (essentially, set FOREACHWIKI_IGNORE_ERRORS for all jobs by default), and once the more advanced model is ready, we can start reverting that change?

My main concern is that ignoring alerts is much easier for affected teams than having discussions like this one, and I'd like to avoid getting used to ignoring alerts.

Hope this makes sense.

I don't want to set FOREACHWIKI_IGNORE_ERRORS to true as default for all jobs because it allows us to surface failure modes that didn't before, and that is needed to gather information about the error codes maintenance scripts should throw on certain failures so we can handle them correctly with podFailurePolicies in the near future.

In the meantime, it's your call whether you want to use FOREACHWIKI_IGNORE_ERRORS for all your jobs or only some of them.

We'll keep you updated when we have something that can handle failures more precisely, there will probably be some work to be done either in the Maintenance framework or in individual scripts to differentiate exit codes, and maybe some puppet definitions to update, but we're not there yet.

In the meantime, I don't think there's much more to do for serviceops here, so I will put it on our Radar.