Page MenuHomePhabricator

Dumps on Airflow are not using the dbstore servers due to etcd timeout
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Jul 7 2025, 8:56 AM
Referenced Files
F78329498: Screenshot 2026-04-27 at 15.14.33.png
Mon, Apr 27, 1:15 PM
F72147275: Screenshot 2026-02-17 at 11.30.00.png
Feb 17 2026, 10:31 AM
F72146801: Screenshot 2026-02-17 at 11.14.50.png
Feb 17 2026, 10:31 AM
F71670641: Screenshot 2026-02-04 at 13.15.49.png
Feb 4 2026, 12:16 PM
F71670612: Screenshot 2026-02-04 at 13.13.03.png
Feb 4 2026, 12:14 PM
F63330299: image.png
Jul 7 2025, 8:56 AM
Subscribers

Description

We noticed that during the recent run of Dumps on Airflow, the attempt to use the dbstore servers failed, seemingly due to a failure to contect etcd.
The following log extract was generated with this command in the dumps toolbox:

www-data@mediawiki-dumps-legacy-toolbox-59ccf784bc-6tvh8:/mnt/dumpsdata/xmldatadumps/private/enwiki/20250701$ grep -v ETA dumplog.txt |more

image.png (939×1 px, 334 KB)

Preparing for job xmlstubsdump of enwiki
Warning: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached in /srv/mediawiki/php-1.45.0-wmf.7/includes/config/EtcdConfig.php on line 206
[a9cc05c4246261cd71e81d05] [no req]   Error: Class "Wikimedia\MWConfig\Exception" not found
Backtrace:
from /srv/mediawiki/src/DBRecordCache.php(155)
#0 /srv/mediawiki/src/DBRecordCache.php(134): Wikimedia\MWConfig\DBRecordCache->fetch(string)
#1 /srv/mediawiki/src/DBRecordCache.php(95): Wikimedia\MWConfig\DBRecordCache->update(string)
#2 /srv/mediawiki/src/DBRecordCache.php(67): Wikimedia\MWConfig\DBRecordCache->get(string)
#3 /srv/mediawiki/wmf-config/CommonSettings.php(232): Wikimedia\MWConfig\DBRecordCache->repopulateDbConf(array)
#4 /srv/mediawiki/php-1.45.0-wmf.7/includes/libs/rdbms/lbfactory/LBFactory.php(189): {closure}()
#5 /srv/mediawiki/php-1.45.0-wmf.7/includes/export/WikiExporter.php(624): Wikimedia\Rdbms\LBFactory->autoReconfigure()
#6 /srv/mediawiki/php-1.45.0-wmf.7/includes/export/WikiExporter.php(491): WikiExporter->reloadDBConfig()
#7 /srv/mediawiki/php-1.45.0-wmf.7/includes/export/WikiExporter.php(314): WikiExporter->dumpPages(string, bool)
#8 /srv/mediawiki/php-1.45.0-wmf.7/includes/export/WikiExporter.php(212): WikiExporter->dumpFrom(string, bool)
#9 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/BackupDumper.php(349): WikiExporter->pagesByRange(int, int, bool)
#10 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/dumpBackup.php(86): MediaWiki\Maintenance\BackupDumper->dump(int, int)
#11 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/includes/MaintenanceRunner.php(691): DumpBackup->execute()
#12 /srv/mediawiki/php-1.45.0-wmf.7/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#13 /srv/mediawiki/multiversion/MWScript.php(221): require_once(string)
#14 {main}

Event Timeline

brouberol changed the task status from Open to In Progress.Jul 7 2025, 9:21 AM

The following network policy enables egress to the conf servers, on port 4001:

NAME                                          POD-SELECTOR                                    AGE
mediawiki-production                          app=mediawiki,release=production                109d

On these servers, we have etcd-tls-proxy running, bound to the port 4001:

brouberol@conf2006:~$ sudo cat /etc/nginx/sites-enabled/etcd-tls-proxy
upstream etcd {
    server conf2006.codfw.wmnet:2379 max_fails=0;
}

server {
    listen 4001 ssl default_server;
    listen [::]:4001 ssl default_server ipv6only=on;
    server_name conf2006.codfw.wmnet;

    ...

    location / {
        proxy_pass https://etcd/;
        proxy_http_version 1.1;
        proxy_set_header    X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header    X-Real-IP $remote_addr;
        client_max_body_size 20971520;
        limit_except GET HEAD OPTIONS {
           deny all;
        }
    }
    ...

}

So this means that the pods affected by this issue are lacking the app=mediawiki and release=production labels.

I'm not seeing the timeout in the task logs. Example: xml/sql dump

I ran the following command in the toolbox:

www-data@mediawiki-dumps-legacy-toolbox-59ccf784bc-6tvh8:/mnt/dumpsdata/xmldatadumps/private$ grep -i etcd */20250701/dumplog.txt

After massaging the grep logs a bit, we see the following list of affected wikis:

arwikiquote
arwikisource
arzwiki
bgwiki
commonswiki
dawiki
dewiki
enwiki
enwiktionary
eswiki
frwiki
frwiktionary
gcrwiki
hewiki
idwiki
itwikisource
jawiki
mgwiktionary
ptwiki
ruwiki
siwiki
svwiki
wikidatawiki
zhwiki

My working theory was that only large wikis were affected, but gcrwiki is a regular wiki, not a large one.

Could it be that the job parallelism is causing contention on the etcd-tls-proxy side, and these are actual timeouts, not egress/firewall issues?

The timeout seems to be set at 2s, which, in case of contention, might not be enough?

	/**
	 * @param array $params Parameter map:
	 *   - host: the host address
	 *   - directory: the etc "directory" were MediaWiki specific variables are located
	 *   - service: service name used in SRV discovery. Defaults to 'etcd'. [optional]
	 *   - port: custom host port [optional]
	 *   - protocol: one of ("http", "https"). Defaults to http. [optional]
	 *   - cache: BagOStuff instance or ObjectFactory spec thereof for a server cache.
	 *            The cache will also be used as a fallback if etcd is down. [optional]
	 *   - cacheTTL: logical cache TTL in seconds [optional]
	 *   - skewTTL: maximum seconds to randomly lower the assigned TTL on cache save [optional]
	 *   - timeout: seconds to wait for etcd before throwing an error [optional]
	 */
	public function __construct( array $params ) {
		$params += [
			'service' => 'etcd',
			'port' => null,
			'protocol' => 'http',
			'cacheTTL' => 10,
			'skewTTL' => 1,
			'timeout' => 2 // <----
		];

		$this->service = $params['service'];
		$this->host = $params['host'];
		$this->port = $params['port'];
		$this->protocol = $params['protocol'];
		$this->directory = trim( $params['directory'], '/' );
		$this->skewCacheTTL = $params['skewTTL'];
		$this->baseCacheTTL = max( $params['cacheTTL'] - $this->skewCacheTTL, 0 );
		$this->timeout = $params['timeout'];

...

I'm going to move this one to Blocked/Waiting, so that we can collect data from the subsequent dumps. We might have to increase that 2s timeout.

Screenshot 2026-02-04 at 13.13.03.png (1×3 px, 261 KB)
The issue still persists although much less frequently than when reported (it used to spam logstash IIRC).

Screenshot 2026-02-04 at 13.15.49.png (1×3 px, 267 KB)
Scratch that. The incidence is pretty much the same.

I can only see these logs in the rsyslog sidecar, which are collecting the mediawiki-cli command logs. While a given pod taken at random from logstash shows these warning logs, the pod itself seems to be running correctly:

brouberol@deploy2002:~$ k logs dewiki-sql-xml-dewiki-dump-remaining-full-j7knf36 -c mediawiki-production-rsyslog | grep -i etcd |wc -l
2
brouberol@deploy2002:~$ k get pod dewiki-sql-xml-dewiki-dump-remaining-full-j7knf36
NAME                                                READY   STATUS    RESTARTS   AGE
dewiki-sql-xml-dewiki-dump-remaining-full-j7knf36   4/4     Running   0          2d2h

Given the low-ish amount of occurences and the fact that the dumps seem to be running fine, I think we should:

  • give this a low priority
  • send this back to the backlog

WDYT @BTullis?

I just had a hunch as to what is happening: this is the same kind of issue that is causing T416345: the pods are running on hosts with a 1G NIC, that are temporarily saturated. This was not mitigated when we depooled these hosts from the k8s cluster ingress, because this is egress traffic.

Looking at the hosts on which these messages were sent, we see that they are mostly the ones with 1G NICs (https://phabricator.wikimedia.org/T415635#11556908)

Screenshot 2026-02-17 at 11.14.50.png (1×1 px, 401 KB)

I aggregated the logs per host over the last 4 weeks, and actually, that hunch seems to be wrong.

Screenshot 2026-02-17 at 11.30.00.png (1×1 px, 193 KB)

More than 50% of these log messages were sent from dse-k8s-worker1014 and dse-k8s-worker1009, both with a 10G NIC.

How does this look, now that T416345 is finished and all of the dse-k8s-worker hosts are on 10 Gbps? Do you think that we can close it?

Screenshot 2026-04-27 at 15.14.33.png (1×2 px, 280 KB)
The issue is much less prevalent than it used to, and does not prevent the dumps from running. I don't think it's worth spending more time on it.