Page MenuHomePhabricator

mediawiki scripts fail on new buster image in deployment-prep
Open, Needs TriagePublic


dumpsgen@deployment-snapshot03:/srv/mediawiki$ php /srv/mediawiki/multiversion/MWScript.php getReplicaServer.php --wiki='enwikinews'
Fatal error: Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:205
Stack trace:
#0 /srv/mediawiki/php-master/includes/config/EtcdConfig.php(126): EtcdConfig->load()
#1 /srv/mediawiki/wmf-config/CommonSettings.php(132): EtcdConfig->getModifiedIndex()
#2 /srv/mediawiki/php-master/LocalSettings.php(5): require('/srv/mediawiki/...')
#3 /srv/mediawiki/php-master/includes/Setup.php(143): require_once('/srv/mediawiki/...')
#4 /srv/mediawiki/php-master/maintenance/doMaintenance.php(90): require_once('/srv/mediawiki/...')
#5 /srv/mediawiki/php-master/maintenance/getReplicaServer.php(54): require_once('/srv/mediawiki/...')
#6 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#7 {main}

Note that this could be because it's a new image and there's something different, or it could be because it's buster. I do know that this same script and mw scripts in general work fine on snapshot02 (stretch), walking (apparently) the same mediawiki code path.

Note also that the reason we wind up with etcd in the mix at all is because CommonSettings.php and InitialiseSettings.php get processed first and then the lab-specific settings get added to override.

Also of interest: puppet has been broken on deployment-etcd-01 since October 30; it can't find its private key, which is now supposed to be called, apparently,, here is the full error from that:

Jan 27 14:32:48 deployment-etcd-01 puppet-agent[28446]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret ssl/ (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/etcd/tlsproxy.pp, line: 45) on node

Event Timeline

A little more info. On the new instance, I run curl from the command line with the url mediawiki is trying to get:

curl -H 'Content-Type: application/json' 'https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379/v2/keys/conftool/v1/mediawiki-config/?recursive=true'
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here:

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

On the old instance:

dumpsgen@deployment-snapshot02:~$ curl -H 'Content-Type: application/json' 'https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379/v2/keys/conftool/v1/mediawiki-config/?recursive=true'
{"action":"get","node":{"key":"/conftool/v1/mediawiki-config","dir":true,"nodes":[{"key":"/conftool/v1/mediawiki-config/codfw","dir":true,"nodes":[{"key":"/conftool/v1/mediawiki-config/codfw/ReadOnly","value":"{\"val\": false}","modifiedIndex":93,"createdIndex":57}],"modifiedIndex":57,"createdIndex":57},{"key":"/conftool/v1/mediawiki-config/common","dir":true,"nodes":[{"key":"/conftool/v1/mediawiki-config/common/WMFMasterDatacenter","value":"{\"val\": \"eqiad\"}","modifiedIndex":96,"createdIndex":49}],"modifiedIndex":48,"createdIndex":48},{"key":"/conftool/v1/mediawiki-config/eqiad","dir":true,"nodes":[{"key":"/conftool/v1/mediawiki-config/eqiad/ReadOnly","value":"{\"val\": false}","modifiedIndex":94,"createdIndex":58},{"key":"/conftool/v1/mediawiki-config/eqiad/dbconfig","value":"{\"val\": null}\n","modifiedIndex":105,"createdIndex":105}],"modifiedIndex":58,"createdIndex":58}],"modifiedIndex":48,"createdIndex":48}}

so there's a missing or wrong cert someplace.

If I pass --cacert /etc/ssl/certs/ca-certificates.crt on sn03, the command works.

On stretch, libcurl is compiled to find the ca-certificates.crt file, but not on stretch. A comparison of the debian directories of the source packags for buster and debian:

[ariel@bigtrouble debian]$ pwd
[ariel@bigtrouble debian]$ grep -r ca-certificates.crt .
./rules:		--with-ca-bundle=/etc/ssl/certs/ca-certificates.crt
./rules:		--with-ca-bundle=/etc/ssl/certs/ca-certificates.crt	\
./rules:		--with-ca-bundle=/etc/ssl/certs/ca-certificates.crt	\
[ariel@bigtrouble debian]$ cd ../../curl_buster/debian/
[ariel@bigtrouble debian]$ grep -r ca-certificates.crt .
[ariel@bigtrouble debian]$

I'll think of some nice thing to do about it for WMCS tomorrow.

And the reason the symlink in /etc/ssl/certs isn't picked up as a fallback:

dumpsgen@deployment-snapshot03:/etc/ssl/certs$ ls -l Puppet_Internal_CA.pem
lrwxrwxrwx 1 root root 55 Jan 27 10:41 Puppet_Internal_CA.pem -> /usr/local/share/ca-certificates/Puppet_Internal_CA.crt
dumpsgen@deployment-snapshot03:/etc/ssl/certs$ openssl x509 -noout -hash -in /usr/local/share/ca-certificates/Puppet_Internal_CA.crt

If I manually symlink Puppet_Internal_CA.pem to aeffde42.0 on snapshot03, curl works and so do mediawiki scripts. I'll find where that's set and fix it tomorrow.

This appears to be an intentional change:

I think we can simply set --cacert /etc/ssl/certs/ca-certificates.crt unconditionally for both Stretch and Buster hosts to fix this?

This appears to be an intentional change:

I think we can simply set --cacert /etc/ssl/certs/ca-certificates.crt unconditionally for both Stretch and Buster hosts to fix this?

I don't know how one does that for libcurl as used by php, and in particular for MediaWiki php scripts.

As I was able to do testing by making the symlink mentioned in an earlier comment, this is no longer a buster migration blocking task and I'm removing it from the list.

I suppose this can be resolved by running /usr/sbin/update-ca-certificates on the instance; maybe this should be added to the docs for creating a new instance in deployment-prep, but I can't find those, only this task: T269500

Has this been resolved meanwhile? I imagine the snapshot hosts in Beta are not different from any other host in Beta connecting to etcd, but it does differ in a way that causes this issue?

I've not spun up any more buster images, and the next one I create will likely be bullseye. Maybe someone else has done so though.