Page MenuHomePhabricator

Fatal error: Uncaught ConfigException: Failed to load configuration from etcd
Closed, ResolvedPublic

Description

1:25:54 Fatal error: Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php:205
01:25:54 Stack trace:
01:25:54 #0 /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php(126): EtcdConfig->load()
01:25:54 #1 /srv/mediawiki-staging/wmf-config/CommonSettings.php(169): EtcdConfig->getModifiedIndex()
01:25:54 #2 /srv/mediawiki-staging/php-master/LocalSettings.php(5): require('/srv/mediawiki-...')
01:25:54 #3 /srv/mediawiki-staging/php-master/includes/Setup.php(153): require_once('/srv/mediawiki-...')
01:25:54 #4 /srv/mediawiki-staging/php-master/maintenance/doMaintenance.php(90): require_once('/srv/mediawiki-...')
01:25:54 #5 /srv/mediawiki-staging/php-master/maintenance/rebuildLocalisationCache.php(248): require_once('/srv/mediawiki-...')
01:25:54 #6 /srv/mediawiki-staging/multiversion/MWScript.php(116): require_once('/srv/mediawiki-...')
01:25:54 #7 {main}
01:25:54 thrown in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php on line 205

Happening during beta sync-masters. Not sure if this is an etcd bug or a new scap version bug.

Event Timeline

I want to note that puppet apparently fails at the etcd host.

urbanecm@deployment-etcd02:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse template ssh/known_hosts.erb:
  Filepath: /var/lib/git/operations/puppet/modules/puppetdbquery/lib/puppetdb/connection.rb
  Line: 68
  Detail: PuppetDB query error: [500] Server Error, query: ["and",["=","type","Sshkey"],["~","title",".*"],["=","exported",true]]
 (file: /etc/puppet/modules/ssh/manifests/client.pp, line: 8, column: 24) on node deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
urbanecm@deployment-etcd02:~$

During investigation, it appears that it fails everywhere, not just on etcd.

Restricted Application added a subscriber: RhinosF1. · View Herald Transcript

Caused by: org.postgresql.util.PSQLException: SSL error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

this happens constantly at the puppetdb03 host

there's also https://gerrit.wikimedia.org/r/c/operations/puppet/+/739266 happening semi-recently, touching PKI. CC'ing @jbond.

Nov 20 07:16:02 deployment-puppetdb03 puppet-agent[18152]: Loading facts
Nov 20 07:16:08 deployment-puppetdb03 puppet-agent[18152]: Caching catalog for deployment-puppetdb03.deployment-prep.eqiad.wmflabs
Nov 20 07:16:08 deployment-puppetdb03 puppet-agent[18152]: Applying configuration version '(a157f596ee) root - deployment-prep: install php 7.4 on a mw appserver'
Nov 20 07:16:09 deployment-puppetdb03 puppet-agent[18152]: Computing checksum on file /etc/ssl/localcerts/WMF_TEST_CA.pem
Nov 20 07:16:09 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Sslcert::Trusted_ca/File[/etc/ssl/localcerts/WMF_TEST_CA.pem]) Filebucketed /etc/ssl/localcerts/WMF_TEST_CA.pem to puppet with sum 6374ff663c61cc49e3a51a66efe4a5da
Nov 20 07:16:09 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Sslcert::Trusted_ca/File[/etc/ssl/localcerts/WMF_TEST_CA.pem]/ensure) ensure changed 'file' to 'link'
Nov 20 07:16:09 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Sslcert::Trusted_ca/File[/etc/ssl/localcerts/WMF_TEST_CA.pem]) Scheduling refresh of Exec[generate trusted_ca]
Nov 20 07:16:09 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Sslcert::Trusted_ca/Exec[generate trusted_ca]) Triggered 'refresh' from 1 event
Nov 20 07:16:11 deployment-puppetdb03 puppet-agent[18152]: The LDAP client stack for this host is: sssd/sudo
Nov 20 07:16:11 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message) defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Nov 20 07:16:12 deployment-puppetdb03 puppet-agent[18152]: (/Stage[main]/Profile::Pki::Client/File[/etc/ssl/certs/WMF_TEST_CA.pem]/ensure) defined content as '{md5}6374ff663c61cc49e3a51a66efe4a5da' (corrective)
Nov 20 07:16:17 deployment-puppetdb03 puppet-agent[18152]: Applied catalog in 9.04 seconds
Nov 20 07:16:17 deployment-puppetdb03 puppet-agent[18152]: Applied catalog in 9.04 seconds
Nov 20 07:45:01 deployment-puppetdb03 puppet-agent-cronjob: Sleeping 20 for random splay
Nov 20 07:45:27 deployment-puppetdb03 puppet-agent[20907]: Using configured environment 'production'
Nov 20 07:45:27 deployment-puppetdb03 puppet-agent[20907]: Retrieving pluginfacts
Nov 20 07:45:27 deployment-puppetdb03 puppet-agent[20907]: Retrieving plugin
Nov 20 07:45:28 deployment-puppetdb03 puppet-agent[20907]: Retrieving locales
Nov 20 07:45:28 deployment-puppetdb03 puppet-agent[20907]: Loading facts
Nov 20 07:45:53 deployment-puppetdb03 puppet-agent[20907]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse template ssh/known_hosts.erb:
Nov 20 07:45:53 deployment-puppetdb03 puppet-agent[20907]:   Filepath: /var/lib/git/operations/puppet/modules/puppetdbquery/lib/puppetdb/connection.rb
Nov 20 07:45:53 deployment-puppetdb03 puppet-agent[20907]:   Line: 68
Nov 20 07:45:53 deployment-puppetdb03 puppet-agent[20907]:   Detail: PuppetDB query error: [500] Server Error, query: ["and",["=","type","Sshkey"],["~","title",".*"],["=","exported",true]]
Nov 20 07:45:53 deployment-puppetdb03 puppet-agent[20907]:  (file: /etc/puppet/modules/ssh/manifests/client.pp, line: 8, column: 24) on node deployment-puppetdb03.deployment-prep.eqiad.wmflabs
Nov 20 07:45:53 deployment-puppetdb03 puppet-agent[20907]: Not using cache on failed catalog
Nov 20 07:45:53 deployment-puppetdb03 puppet-agent[20907]: Could not retrieve catalog; skipping run
Nov 20 08:15:01 deployment-puppetdb03 puppet-agent-cronjob: Sleeping 55 for random splay
Nov 20 08:16:24 deployment-puppetdb03 puppet-agent[23084]: Unable to fetch my node definition, but the agent run will continue:
Nov 20 08:16:24 deployment-puppetdb03 puppet-agent[23084]: Error 500 on SERVER: Server Error: Could not retrieve facts for deployment-puppetdb03.deployment-prep.eqiad.wmflabs: Failed to find facts from PuppetDB at puppet:8140: Failed to execute '/pdb/query/v4/nodes/deployment-puppetdb03.deployment-prep.eqiad.wmflabs/facts' on at least 1 of the following 'server_urls': https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs

this is in puppet.log at the puppetdb host. It appears to make some changes requested by the pki puppet profile, and then it starts to fail.

The puppet failure on the second deployment host is:

thcipriani@deployment-deploy03:~$ sudo puppet agent -tv
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Error 500 on SERVER: Server Error: Could not retrieve facts for deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud: Failed to find facts from PuppetDB at puppet:8140: Failed to execute '/pdb/query/v4/nodes/deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud/facts' on at least 1 of the following 'server_urls': https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=bfa5996e865c9eaffe5f7a9ac592ceacd0f4d120&version=5&certname=deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud&command=replace_facts&producer-timestamp=2021-11-20T09:23:23.389Z' on at least 1 of the following 'server_urls': https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
thcipriani@deployment-deploy03:~$

Mentioned in SAL (#wikimedia-releng) [2021-11-20T10:01:53Z] <majavah> root@deployment-puppetdb03:~# cp /var/lib/puppet/ssl/certs/ca.pem /etc/ssl/certs/Puppet_Internal_CA.pem && systemctl restart puppetdb.service # T296125

Web requests to the Beta cluster (e.g. https://en.wikipedia.beta.wmflabs.org/) are currently broken with the same message, but were still working 2 hours ago (otherwise this CI build, which makes requests against Beta, would’ve failed).

Request ID: YZke7tJMQlo-5vnBpYjveAAAAEk
Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:205

Web requests to the Beta cluster (e.g. https://en.wikipedia.beta.wmflabs.org/) are currently broken with the same message, but were still working 2 hours ago (otherwise this CI build, which makes requests against Beta, would’ve failed).

Request ID: YZke7tJMQlo-5vnBpYjveAAAAEk
Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:205

Interestingly, upload.wikimedia.beta.wmflabs.org is still working. https://commons.wikimedia.beta.wmflabs.org/wiki/ and https://en.wikipedia.beta.wmflabs.org/ are broken though.

T296000 is just a coincidence or could there be a relation?

Web requests to the Beta cluster (e.g. https://en.wikipedia.beta.wmflabs.org/) are currently broken with the same message, but were still working 2 hours ago (otherwise this CI build, which makes requests against Beta, would’ve failed).

Request ID: YZke7tJMQlo-5vnBpYjveAAAAEk
Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:205

Interestingly, upload.wikimedia.beta.wmflabs.org is still working. https://commons.wikimedia.beta.wmflabs.org/wiki/ and https://en.wikipedia.beta.wmflabs.org/ are broken though.

upload.wm.beta.wmflabs.org requests are not served by MW directly, so MW failing to load parts of its configuration from etcd does not affect it.

T296000 is just a coincidence or could there be a relation?

Coincidence.

Jdforrester-WMF triaged this task as Unbreak Now! priority.Mon, Nov 22, 3:12 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Within the context of the Beta Cluster this is UBN as it's blocking all use.

Within the context of the Beta Cluster this is UBN as it's blocking all use.

Note the root cause is T296127, which already is an UBN.

Mentioned in SAL (#wikimedia-releng) [2021-11-22T15:38:49Z] <majavah> update wmf-certificates on deployment-mediawiki11 T296125

Mentioned in SAL (#wikimedia-releng) [2021-11-22T15:38:49Z] <majavah> update wmf-certificates on deployment-mediawiki11 T296125

This seems to have unbroken page views. https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ still needs to be re-enabled.

Mentioned in SAL (#wikimedia-releng) [2021-11-22T15:43:52Z] <hashar> Enabling beta-scap-sync-world # T296125