Page MenuHomePhabricator

growthexperiments-updatementeedata-s1 errors out on DB timeout
Open, MediumPublicPRODUCTION ERROR

Description

Error
log
migr@deploy1003:~$ kubectl logs -f job/growthexperiments-updatementeedata-s1-29138595 mediawiki-main-app
extensions/GrowthExperiments/maintenance/updateMenteeData.php: Start run
extensions/GrowthExperiments/maintenance/updateMenteeData.php: Running on growthexperiments & s1
extensions/GrowthExperiments/maintenance/updateMenteeData.php: Running on growthexperiments & s1
enwiki MediaWiki\Config\ConfigException from line 231 of /srv/mediawiki/php-1.45.0-wmf.2/includes/config/EtcdConfig.php: Failed to load configuration from etcd: (curl error: 28) Timeout was reached
enwiki #0 /srv/mediawiki/php-1.45.0-wmf.2/includes/config/EtcdConfig.php(144): MediaWiki\Config\EtcdConfig->load()
enwiki #1 /srv/mediawiki/wmf-config/CommonSettings.php(226): MediaWiki\Config\EtcdConfig->get('eqiad/dbconfig')
enwiki #2 /srv/mediawiki/php-1.45.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactory.php(191): {closure}()
enwiki #3 /srv/mediawiki/php-1.45.0-wmf.2/extensions/GrowthExperiments/includes/MentorDashboard/MenteeOverview/MenteeOverviewDataUpdater.php(108): Wikimedia\Rdbms\LBFactory->autoReconfigure()
enwiki #4 /srv/mediawiki/php-1.45.0-wmf.2/extensions/GrowthExperiments/maintenance/updateMenteeData.php(92): GrowthExperiments\MentorDashboard\MenteeOverview\MenteeOverviewDataUpdater->updateDataForMentor(Object(MediaWiki\User\UserIdentityValue))
enwiki #5 /srv/mediawiki/php-1.45.0-wmf.2/maintenance/includes/MaintenanceRunner.php(691): GrowthExperiments\Maintenance\UpdateMenteeData->execute()
enwiki #6 /srv/mediawiki/php-1.45.0-wmf.2/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
enwiki #7 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
enwiki #8 {main}

Is there something we can do about this?

Impact
  • the update on (some?) mentee data for mentors has been delayed for some additional hours on enwiki. But from looking at the logs, the subsequent run of the script seems to have been successful, so the data should be as correct as usual again.
Notes
  • I accidentally ACKed the serviceops alert on alerts.wikimedia.org, because I didn't realize that these were team-specific and that it was different from the growth alert. I'm sorry!

Event Timeline

At the first sight, this looks like a transient etcd failure, given the jobs managed to execute later on. I'd expect any etcd failure (incl transient ones) to trigger a significant amount of error logs, since we heavily depend on etcd being available (among other things, etcd is the component that knows what database servers MediaWiki should talk to, meaning it is one of the first services MediaWiki contacts, and one of the services that is contacted for virtually any operation MW decides to do).

Let's check if that is the case. Firstly, we need some timestamp on the job @Michael cited:

[urbanecm@deploy1003 ~]$ kubectl get -o yaml job/growthexperiments-updatementeedata-s1-29138595
apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2025-05-27T03:15:00Z"
[...]
status:
  conditions:
  - lastProbeTime: "2025-05-27T03:38:48Z"
    lastTransitionTime: "2025-05-27T03:38:48Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 1
  startTime: "2025-05-27T03:15:00Z"
[urbanecm@deploy1003 ~]$

So, the job failed today early in the EU morning. Let's check Logstash for the last few days and "etcd" in the error message (https://logstash.wikimedia.org/goto/f9771b44bd0b8a67664f211cefc87e16):

image.png (1×3 px, 279 KB)

There are virtually no etcd-related message. Is it possible that only the job itself had problems contacting etcd? Why that might be the case? And more importantly: why it is not included in the mediawiki-errors dashboard? Shouldn't logs like this one be present there?

I'm curious what serviceops's take would be on this one (@Clement_Goubert, do you have any thoughts)?

At the first sight, this looks like a transient etcd failure, given the jobs managed to execute later on. I'd expect any etcd failure (incl transient ones) to trigger a significant amount of error logs, since we heavily depend on etcd being available (among other things, etcd is the component that knows what database servers MediaWiki should talk to, meaning it is one of the first services MediaWiki contacts, and one of the services that is contacted for virtually any operation MW decides to do).

Let's check if that is the case. Firstly, we need some timestamp on the job @Michael cited:

As far as I can tell, given the runtime of the pod before failure, there were multiple connections to etcd made successfully before it ended up failing. Wikimedia\Rdbms\LBFactory->autoReconfigure() which does the etcd call is called on every iteration of the mentor update loop by updateDataForMentor.

[...]

There are virtually no etcd-related message. Is it possible that only the job itself had problems contacting etcd? Why that might be the case? And more importantly: why it is not included in the mediawiki-errors dashboard? Shouldn't logs like this one be present there?

I'm curious what serviceops's take would be on this one (@Clement_Goubert, do you have any thoughts)?

They are in mediawiki-errors, but the search is case sensitive.

As far as the failure itself goes, I am not finding a smoking gun, the etcd cluster was healthy, there's nothing in the kubernetes worker node logs around that time that would explain a connection failure.

Looking at the normalized message, we can see all of these errors are from mw-cron jobs and some mw-script, mostly from updatementeedata and refreshlinkrecommendation.

There may be something going on with different configurations between the fpm and the cli instances, we'll keep digging.

FWIW, we recently also encountered an etcd failure with a different job (T398592), specifically deleteoldsurveys. It also affects non-Growth jobs as well.

I just happened to come across a mention of etcd when reading up about SLOs:

Certain kinds of infrastructure should only ever be used as a soft dependency: in its absence, some functionality may be degraded but the user experience shouldn’t fail completely. A good example is etcd: it’s a good place to store global configuration, because its design chooses strong consistency over high availability. If etcd is unavailable, we can’t update those configuration values, but their cached values persist, and MediaWiki should still be able to serve wiki pages without depending on reading those values on every request.

In that sense, etcd can be an “attractive nuisance.” An engineer might decide to use etcd for something critical, not fully understanding its reliability characteristics, and so inadvertently introduce a hard dependency on a service that can’t support it.

(source: https://wikitech.wikimedia.org/wiki/SLO/Runbook#Things_we_don't_do:_Intentionally_burning_error_budget)

I wonder what is the proper way to deal with such a "soft dependency" like etcd?

  • Should the config callback that sits behind $conf = ( $this->configCallback )(); in LBFactory::autoReconfigure be somehow more fault-tolerant?
  • Should LBFactory::autoReconfigure catch any ConfigException and just continue on as if no reconfiguration was needed?
  • Should our maintenance script code wrap the call to LBFactory::autoReconfigure into a try/catch and continue on a ConfigException?
  • Should we just accept that our maintenance script sometimes fails as part of its "error budget"? (As part of updatementeedata, this does have a user impact, but it is fixed a few hours later when the maintenace script's next run.)

cc @Ladsgroup and @daniel as recent-ish authors of LBFactory::autoReconfigure

Michael triaged this task as Medium priority.Jul 22 2025, 10:43 AM
Michael edited projects, added Growth-Team (Maintenance); removed Growth-Team.