Page MenuHomePhabricator

EtcdConfig failed to fetch data: (curl error: 60) SSL peer certificate or SSH remote key was not OK
Closed, ResolvedPublic

Description

23:46:18 'mwscript eval.php --wiki aawiki' generated unexpected output: Warning: EtcdConfig failed to fetch data: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php on line 204
23:46:18 Warning: EtcdConfig failed to fetch data: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php on line 204

Event Timeline

Reedy triaged this task as Unbreak Now! priority.May 11 2025, 10:48 PM

wmf-config/LabsServices.php has _etcd._tcp.svc.deployment-prep.eqiad1.wikimedia.cloud. From https://horizon.wikimedia.org/ngdetails/OS::Designate::Zone/a4a6c9da-9a6e-44cd-acf8-93c3101057c2 that is an SRV entry pointing to deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud.

When connecting on the host:

The last Puppet run was at Wed Apr 16 12:32:36 UTC 2025 (37120 minutes ago).

Because it fails with:

May 12 07:02:42 deployment-etcd05 puppet-agent[165659]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'prometheus::instances_defaults' on node deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud

Even though hieradata/common/prometheus.yaml has:

prometheus::instances_defaults:
  retention_time: 4032h
  retention_size: ~
  thanos_upload: true
  k8s_cluster_name: ~
  hosts: ~
  provision_lv_size: '50g'

I have forked that to T393866

On deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud in journalctl:

May 11 18:58:01 deployment-etcd05 etcd[484]: rejected connection from "172.16.3.170:59336" (error "remote error: tls: expired certificate", ServerName "deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud")

Looking at the certificate with openssl s_client -connect deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud -port 2379:

CONNECTED(00000003)
depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = WMF_TEST_CA
verify return:1
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = etcd
verify return:1
depth=0 CN = deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud
verify error:num=10:certificate has expired
notAfter=May 11 18:58:00 2025 GMT
verify return:1
depth=0 CN = deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud
notAfter=May 11 18:58:00 2025 GMT
verify return:1
---
Certificate chain
 0 s:CN = deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud
   i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = etcd
 1 s:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = etcd
   i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = WMF_TEST_CA

I have no idea how that certificate has been generated or how it can be regenerated nor do I know about etcd configuration :(

Once I got Puppet working in T393866: deployment-etcd05: Function lookup() did not find a value for the name 'prometheus::instances_defaults' it looks to me like it fixed the cert issue:

Notice: /Stage[main]/Main/Cfssl::Cert[etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud]/Exec[create chained cert /var/lib/etcd/ssl/etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud.chain.pem]/returns: executed successfully (corrective)
Notice: /Stage[main]/Main/Cfssl::Cert[etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud]/Exec[create chained cert /var/lib/etcd/ssl/etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud.chain.pem]: Triggered 'refresh' from 1 event
bd808 claimed this task.

I think the lack of EtcdConfig complaints here means the certs are fixed.

$ mwscript eval.php --wiki aawiki -d 2
DEPRECATION WARNING: Maintenance scripts are moving to Kubernetes. See
https://wikitech.wikimedia.org/wiki/Maintenance_scripts for the new process.
Maintenance hosts will be going away; please submit feedback promptly if
maintenance scripts on Kubernetes don't work for you. (T341553)
[debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff::initializeClient: initializing new client instance.
[debug] [memcached] MainWANObjectCache using store Wikimedia\ObjectCache\MemcachedPeclBagOStuff
[debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff::initializeClient: initializing new client instance.
[debug] [memcached] MicroStash using store Wikimedia\ObjectCache\MemcachedPeclBagOStuff
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 0/aawiki
[debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-server-readonly:deployment-db11|#|v)
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::loadSessionPrimaryPos: executed chronology callback.
[debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-gauge:3:DEFAULT:deployment-db11:deployment-db11|#|v)
[debug] [rdbms] Wikimedia\Rdbms\LoadMonitor::getStateFromWanCache: WAN cache hit for 'deployment-db11'
[debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-gauge:3:DEFAULT:deployment-db11:deployment-db14|#|v)
[debug] [rdbms] Wikimedia\Rdbms\LoadMonitor::getStateFromWanCache: WAN cache hit for 'deployment-db14'
[debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-lags:DEFAULT|#|v)
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: connecting to deployment-db14...
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 1/
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getReaderIndex: using server deployment-db14 for group ''
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reuseOrOpenConnectionForNewRef: reusing connection for 1/aawiki
>