23:46:18 'mwscript eval.php --wiki aawiki' generated unexpected output: Warning: EtcdConfig failed to fetch data: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php on line 204 23:46:18 Warning: EtcdConfig failed to fetch data: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php on line 204
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | bd808 | T393855 EtcdConfig failed to fetch data: (curl error: 60) SSL peer certificate or SSH remote key was not OK | |||
| Resolved | bd808 | T393866 deployment-etcd05: Function lookup() did not find a value for the name 'prometheus::instances_defaults' |
Event Timeline
wmf-config/LabsServices.php has _etcd._tcp.svc.deployment-prep.eqiad1.wikimedia.cloud. From https://horizon.wikimedia.org/ngdetails/OS::Designate::Zone/a4a6c9da-9a6e-44cd-acf8-93c3101057c2 that is an SRV entry pointing to deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud.
When connecting on the host:
The last Puppet run was at Wed Apr 16 12:32:36 UTC 2025 (37120 minutes ago).
Because it fails with:
May 12 07:02:42 deployment-etcd05 puppet-agent[165659]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'prometheus::instances_defaults' on node deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud
Even though hieradata/common/prometheus.yaml has:
prometheus::instances_defaults: retention_time: 4032h retention_size: ~ thanos_upload: true k8s_cluster_name: ~ hosts: ~ provision_lv_size: '50g'
I have forked that to T393866
On deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud in journalctl:
May 11 18:58:01 deployment-etcd05 etcd[484]: rejected connection from "172.16.3.170:59336" (error "remote error: tls: expired certificate", ServerName "deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud")
Looking at the certificate with openssl s_client -connect deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud -port 2379:
CONNECTED(00000003) depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = WMF_TEST_CA verify return:1 depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = etcd verify return:1 depth=0 CN = deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud verify error:num=10:certificate has expired notAfter=May 11 18:58:00 2025 GMT verify return:1 depth=0 CN = deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud notAfter=May 11 18:58:00 2025 GMT verify return:1 --- Certificate chain 0 s:CN = deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = etcd 1 s:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = etcd i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = WMF_TEST_CA
I have no idea how that certificate has been generated or how it can be regenerated nor do I know about etcd configuration :(
Mentioned in SAL (#wikimedia-releng) [2025-05-12T07:58:41Z] <hashar> Disabled https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ due to a failure with Etcd/expired certificate # T393855
Mentioned in SAL (#wikimedia-releng) [2025-05-12T08:28:05Z] <hashar> Disabled https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ due to a failure with Etcd/expired certificate # T393855
Once I got Puppet working in T393866: deployment-etcd05: Function lookup() did not find a value for the name 'prometheus::instances_defaults' it looks to me like it fixed the cert issue:
Notice: /Stage[main]/Main/Cfssl::Cert[etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud]/Exec[create chained cert /var/lib/etcd/ssl/etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud.chain.pem]/returns: executed successfully (corrective) Notice: /Stage[main]/Main/Cfssl::Cert[etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud]/Exec[create chained cert /var/lib/etcd/ssl/etcd__deployment-etcd05_deployment-prep_eqiad1_wikimedia_cloud.chain.pem]: Triggered 'refresh' from 1 event
I think the lack of EtcdConfig complaints here means the certs are fixed.
$ mwscript eval.php --wiki aawiki -d 2 DEPRECATION WARNING: Maintenance scripts are moving to Kubernetes. See https://wikitech.wikimedia.org/wiki/Maintenance_scripts for the new process. Maintenance hosts will be going away; please submit feedback promptly if maintenance scripts on Kubernetes don't work for you. (T341553) [debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff::initializeClient: initializing new client instance. [debug] [memcached] MainWANObjectCache using store Wikimedia\ObjectCache\MemcachedPeclBagOStuff [debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff::initializeClient: initializing new client instance. [debug] [memcached] MicroStash using store Wikimedia\ObjectCache\MemcachedPeclBagOStuff [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 0/aawiki [debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-server-readonly:deployment-db11|#|v) [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::loadSessionPrimaryPos: executed chronology callback. [debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-gauge:3:DEFAULT:deployment-db11:deployment-db11|#|v) [debug] [rdbms] Wikimedia\Rdbms\LoadMonitor::getStateFromWanCache: WAN cache hit for 'deployment-db11' [debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-gauge:3:DEFAULT:deployment-db11:deployment-db14|#|v) [debug] [rdbms] Wikimedia\Rdbms\LoadMonitor::getStateFromWanCache: WAN cache hit for 'deployment-db14' [debug] [memcached] Wikimedia\ObjectCache\MemcachedPeclBagOStuff debug: getMulti(WANCache:global:rdbms-lags:DEFAULT|#|v) [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: connecting to deployment-db14... [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 1/ [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getReaderIndex: using server deployment-db14 for group '' [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reuseOrOpenConnectionForNewRef: reusing connection for 1/aawiki >