Connections fail with:
root@cumin1001:~# db-mysql db2110 ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain
Marostegui | |
Dec 7 2023, 11:14 AM |
F41691086: image.png | |
Jan 15 2024, 10:20 AM |
F41658597: image.png | |
Jan 8 2024, 11:30 AM |
F41573747: image.png | |
Dec 8 2023, 8:47 AM |
Connections fail with:
root@cumin1001:~# db-mysql db2110 ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
production.sql.erb: Add cumin1002 | operations/puppet | production | +7 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T330490 Next steps for Puppet 7 | |||
Resolved | jbond | T340739 Create cookbook to migrate servers from the puppetmasters to puppetservers | |||
In Progress | None | T349619 Migrate roles to puppet7 | |||
Open | None | T256888 Add 'End investigation' button to Special:Investigate | |||
Open | Dreamy_Jazz | T190666 Add a check outcome feature | |||
Resolved | Marostegui | T354336 Add columns cul_result_id and cul_result_plaintext_id to cu_log | |||
Resolved | ABran-WMF | T352974 puppet7 on cumin breaks database connections | |||
Declined | BTullis | T354411 Revert dbstore migration from puppet7 to puppet5 | |||
Resolved | Ladsgroup | T354719 Deploy grants for cumin1002 |
cumin1001 has been reverted to Puppet 5, but cumin2002 is on Puppet 7 and can be used to reproduce.
db1124 can be used for testing. It is a test host running puppet 7. It can be restarted, rebooted, reimaged, whatever is needed
Just took a quick look:
# db-mysql db1133 ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain
mysql:root@localhost [(none)]> show global variables like 'ssl_ca%'; +---------------+---------------------------------------+ | Variable_name | Value | +---------------+---------------------------------------+ | ssl_ca | /etc/ssl/certs/Puppet_Internal_CA.pem | | ssl_capath | | +---------------+---------------------------------------+ 2 rows in set (0.001 sec)
However:
root@db1133:~# cat /etc/my.cnf | grep ssl-ca ssl-ca=/etc/ssl/certs/wmf-ca-certificates.crt
I've tried changing that to:
ssl-ca=/etc/ssl/certs/Puppet_Internal_CA.pem
And after restarting mariadb, I can now connect from cumin2002 (puppet7) to db1133 (puppet7). However (of course) cumin1001 (puppet5) now fails to connect to db1133.
The same happens with db1124 (puppet7) and the above procedure.
This needs more investigation
This has more implications, as orchestrator cannot see these hosts (db1124, db1133) (with the changed cert). So this really needs lots of carefulness
15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 ERROR ReadTopologyInstance(db1124.eqiad.wmnet:3306) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Dec 7 12:07:15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 WARNING DiscoverInstance(db1124.eqiad.wmnet:3306) instance is nil in 0.069s (Backend: 0.015s, Instance: 0.055s), error=x509: issuer name does not match subject from issuing certificate Dec 7 12:07:25 dborch1001 orchestrator[425]: 2023-12-07 12:07:25 ERROR ReadTopologyInstance(db1133.eqiad.wmnet:3306) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Dec 7 12:07:25 dborch1001 orchestrator[425]: 2023-12-07 12:07:25 WARNING DiscoverInstance(db1133.eqiad.wmnet:3306) instance is nil in 0.098s (Backend: 0.017s, Instance: 0.082s), error=x509: issuer name does not match subject from issuing certificate
which could also be the root cause for this error from dborch as it uses that cert to connect:
one other interesting fact:
a puppet 7 host
has the exact opposite situation where both /etc/ssl/certs/Puppet_Internal_CA.pem and /etc/ssl/certs/wmf-ca-certificates.crt are OK to be used to connect from cumin2002, whereas from cumin1001:
root@cumin1001:~# sudo db-mysql db1124 --ssl-ca /etc/ssl/certs/Puppet_Internal_CA.pem ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain root@cumin1001:~# sudo db-mysql db1124 --ssl-ca /etc/ssl/certs/wmf-ca-certificates.crt Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 42037 Server version: 10.6.14-MariaDB-log MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. root@db1124.eqiad.wmnet[(none)]>
it appears that most of our hosts are still using /etc/ssl/certs/Puppet_Internal_CA.pem and should be migrated to use /etc/ssl/certs/wmf-ca-certificates.crt instead. What puzzles me is that orchestrator uses /etc/ssl/certs/wmf-ca-certificates.crt which has been consistently OK to connect to hosts as we've been able to see here :
and here:
so it does not explain why orchestrator fails to connect to some db hosts:
$ sudo journalctl -n 500 -u orchestrator|grep -i x509|grep -iEo '\Sdb.*.*wmnet'|sort|uniq (db1124.eqiad.wmnet (db1133.eqiad.wmnet
I believe the collab tag was added automatically from the parent task so removing it.
@ABran-WMF @MoritzMuehlenhoff we are going to have to give this more priority, dbstore1003 (s1) is now failing in orchestrator as being restarted during the Christmas break as part of: https://phabricator.wikimedia.org/T351921#9426477
The instance is back and running normally but orchestrator has been failing since the restart with:
n 2 06:06:22 dborch1001 orchestrator[3587041]: 2024-01-02 06:06:22 ERROR ReadTopologyInstance(dbstore1003.eqiad.wmnet:3311) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Jan 2 06:06:22 dborch1001 orchestrator[3587041]: ReadTopologyInstance(dbstore1003.eqiad.wmnet:3311) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Jan 2 06:06:22 dborch1001 orchestrator[3587041]: 2024-01-02 06:06:22 WARNING DiscoverInstance(dbstore1003.eqiad.wmnet:3311) instance is nil in 0.024s (Backend: 0.003s, Instance: 0.021s), error=x509: issuer name does not match subject from issuing certificate Jan 2 06:06:22 dborch1001 orchestrator[3587041]: DiscoverInstance(dbstore1003.eqiad.wmnet:3311) instance is nil in 0.024s (Backend: 0.003s, Instance: 0.021s), error=x509: issuer name does not match subject from issuing certificate
And as expected, it also fails from cumin1001:
[06:09:43] marostegui@cumin1001:~$ sudo db-mysql dbstore1003:3311 ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain
To confirm, this host runs puppet7.
dbstore1003 is a MariaDB server containing replicas of mediawiki databases for analytics & research usage (mariadb::analytics_replica) DB section s1 (alias: mysql.s1) DB section s5 (alias: mysql.s5) DB section s7 (alias: mysql.s7) Bare Metal host on site eqiad and rack A3 Host has been migrated to puppet7
This is likely the issue something, somewhere likley is still using Puppet_internal_ca.pem, /var/lib/puppet/ssl/ca/ca.pem or $facts['pupet_config']['localcecert'] directly
I think another factor is the long-running nature of mysqld processes since any config change would only get picked up after a daemon restart.
Could it be this reference in wmfdb that should be updated to /etc/ssl/certs/wmf-ca-certificates.crt?
https://gitlab.wikimedia.org/repos/sre/wmfdb/-/blob/main/wmfdb/mysql_cli.py?ref_type=heads#L6
I've not really had much to do with wmfdb yet, but it would seem that if we update this and deploy a new version of the wmfdb package, that would allow db-mysql to work with both puppet 5 and puppet 7 based hosts.
That's a very good point @BTullis. I'd leave this to @ABran-WMF and @MoritzMuehlenhoff.
Orchestrator is still an issue though (which has nothing to do with wmfdb)
it seems that orchestrator follows the same pattern as the one @Marostegui identified here:
I'll look for a fix. In the meantime, thank you @BTullis! Here is the merge request of your suggestion
for the orchestrator part, it seems that mariadb client should be using the properwmf-ca-certificatesfile.
All certificates seem to be present on dborch1001:
# ls -l /usr/local/share/ca-certificates/Puppet_Internal_CA.crt -r--r--r-- 1 root root 2977 Nov 23 11:02 /usr/local/share/ca-certificates/Puppet_Internal_CA.crt root@dborch1001:~# ls -l /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt -rw-r--r-- 1 root root 1111 Nov 20 17:43 /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt
root@dborch1001:/home/arnaudb# grep -i cafile /etc/orchestrator.conf.json "MySQLOrchestratorSSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt", "MySQLTopologySSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
root@dborch1001:/home/arnaudb# ls -lahrt /etc/ssl/certs|grep -iE 'wiki|pup|wmf' lrwxrwxrwx 1 root root 22 Mar 22 2023 c5aaad6f.0 -> Puppet_Internal_CA.pem lrwxrwxrwx 1 root root 53 Mar 22 2023 wmf_ca_2017_2020.pem -> /usr/local/share/ca-certificates/wmf_ca_2017_2020.crt lrwxrwxrwx 1 root root 20 Mar 22 2023 e3f15d55.0 -> wmf_ca_2017_2020.pem lrwxrwxrwx 1 root root 67 Nov 21 10:58 Wikimedia_Internal_Root_CA.pem -> /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt lrwxrwxrwx 1 root root 60 Nov 21 10:58 Puppet5_Internal_CA.pem -> /usr/share/ca-certificates/wikimedia/Puppet5_Internal_CA.crt lrwxrwxrwx 1 root root 30 Nov 21 10:58 c0cdb94e.0 -> Wikimedia_Internal_Root_CA.pem lrwxrwxrwx 1 root root 55 Nov 23 11:02 Puppet_Internal_CA.pem -> /usr/local/share/ca-certificates/Puppet_Internal_CA.crt lrwxrwxrwx 1 root root 23 Nov 23 11:02 c5aaad6f.1 -> Puppet5_Internal_CA.pem -rw-r--r-- 1 root root 3.0K Nov 23 11:02 wmf-ca-certificates.crt
Side note: I noticed that wmf-ca-certificates.crt is absent from /usr/local/share/ca-certificates
On db1215 (zarcillo/orchestrator database) side:
arnaudb@db1215:~ $ grep -i ssl-ca /etc/my.cnf ssl-ca=/etc/ssl/certs/wmf-ca-certificates.crt arnaudb@db1215:~ $ sudo mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 175167991 Server version: 10.6.12-MariaDB-log MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show global variables like 'ssl_ca%'\G *************************** 1. row *************************** Variable_name: ssl_ca Value: /etc/ssl/certs/Puppet_Internal_CA.pem *************************** 2. row *************************** Variable_name: ssl_capath Value:
we have the old config running, and orchestrator is able to connect to its database. So it might be something else different of this:
@MoritzMuehlenhoff what is the plan with cumin1002? @ABran-WMF has fixed db-mysql on cumin1001 so we can connect from there to both puppet5 and puppet7. If we could migrate cumin1001 to puppet7 and once confirmed it all works, drop cumin1002 that'd be ideal, otherwise we'd need to deploy grants for cumin1002 too across all the databases.
Unfortunately we'll have to update grants: The move towards cumin1002 (which is a VM) was only partly motivated by the split setup wrt Puppet 7: cumin1001 (which runs on hardware) is almost six years old and way out of warranty. Last year we decided to move one of the cumin hosts to a Ganeti VM (and keep the other one on baremetal so that we have a safety net in case ganeti is fully down), but hadn't found the time to work on that, so when the situation with Puppet 7 and DB admin access happened, I killed two birds with one stone. On the upside, now that cumin/eqiad is on VM, we'll no longer need to bother with hardware replacements in the future.
Got it, no problem - we'll add the cumin1002 grants and then once cumin1001 is gone, remove those.
Change 989144 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] production.sql.erb: Add cumin1002
Change 989144 merged by Marostegui:
[operations/puppet@production] production.sql.erb: Add cumin1002
@Ladsgroup I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/989144 - can you please deploy the user for cumin1002's IP with OMG?
@MoritzMuehlenhoff I'm currently trying to trace how orchestrator connects to databases to manage them, to identify which certificate is really used.
it's supposed to be:
grep -i cafile /etc/orchestrator.conf.json "MySQLOrchestratorSSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt", "MySQLTopologySSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
which seems to be valid:
md5sum /etc/ssl/certs/wmf-ca-certificates.crt 491c425507b080960b6ba8255d7cff46 /etc/ssl/certs/wmf-ca-certificates.crt
and working properly from cumin1001:
arnaudb@cumin1001:~ $ sudo db-mysql dbstore1008:3317 Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 1736065 Server version: 10.6.16-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
but not from Orchestrator
That got configured via https://gerrit.wikimedia.org/r/c/operations/puppet/+/972367/
but not from Orchestrator
Can you elaborate what works and what not? Specific operations? To all hosts or just the ones also running Puppet 7? Orchestrator was switched to Puppet 7 on Nov 23 so it appears to me it can't be fully broken I'd expect.
I've been able to connect with the following:
mysql --ssl-ca /etc/ssl/certs/wmf-ca-certificates.crt -h dbstore1008.eqiad.wmnet -P3311 -u orchestrator -p Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 2595824 Server version: 10.6.16-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
which is supposed to be the chain used by orchestrator
which makes me think that there has been some mishap on configuration or orchestrator restart maybe?
[edit] for a bit more context:
dbstore1008 is one of the hosts that is unreachable and therefore not yet referenced on Orchestrator
I see these entries in the logs from orchestrator.
Jan 09 14:43:15 dborch1001 orchestrator[3587041]: 2024-01-09 14:43:15 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3318) instance is nil in 0.045s (Backend: 0.004s, Instance: 0.041s), error=x509: issuer name does not match subject from issuing certificate Jan 09 14:43:18 dborch1001 orchestrator[3587041]: 2024-01-09 14:43:18 ERROR ReadTopologyInstance(dbstore1008.eqiad.wmnet:3311) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
https://gerrit.wikimedia.org/r/c/operations/puppet/+/972367/ was deployed November 09, but dborch1001 has an uptime of 47 days, so the change must already be in effect restart-wise. Still a restart of Orchestrator won't hurt I guess.
One other option is that the TLS toolchain as used by Orchestrator be not handle a bundled certificate file correctly, so that when wmf-ca-certificates.crt is used, only Puppet_Internal_CA.pem is detected/passed to the DB server (which would explain while it fails to connect to hosts unknown to Puppet 7).
To confirm we could also either spin up a second Orchestrator instance (since we don't seem to have a test instance) or temporarily only configure the new cert (to confirm that in this case only Puppet7 servers would be reachable).
In case it helps, this is also a useful command for showing the certificate chain that is presented by the dbstore servers.
btullis@dborch1001:~$ openssl s_client -connect dbstore1008.eqiad.wmnet:3311 -starttls mysql -showcerts CONNECTED(00000003) depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA verify return:1 depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa verify return:1 depth=0 CN = dbstore1008.eqiad.wmnet verify return:1 --- Certificate chain 0 s:CN = dbstore1008.eqiad.wmnet i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa <snip snip snip>
Orchestrator does load the ca bundle:
Dec 14 08:51:50 dborch1001 orchestrator[3587041]: 2023-12-14 08:51:50 INFO Read in CA file: /etc/ssl/certs/wmf-ca-certificates.crt Dec 14 08:51:50 dborch1001 orchestrator[3587041]: Read in CA file: /etc/ssl/certs/wmf-ca-certificates.crt
from this method but I haven't been able to find information about ca bundle issues with go's "crypto/tls" package. A GET call on this URL reproduces our issue with db1215's replica which is impacted by this issue as well.
Maybe it also has something to do with:
I'm not sure spinning up a new orchestrator instance specifically for this would be needed, but in the longer term it could be a good tool to help us debug.
it does not!
$ sudo cp /etc/ssl/certs/wmf-ca-certificates.crt /usr/local/share/ca-certificates/ $ sudo update-ca-certificates Updating certificates in /etc/ssl/certs... rehash: warning: skipping wmf-ca-certificates.crt,it does not contain exactly one certificate or CRL rehash: warning: skipping wmf-ca-certificates.pem,it does not contain exactly one certificate or CRL rehash: warning: skipping Puppet_Internal_CA.pem,it does not contain exactly one certificate or CRL 2 added, 1 removed; done. Running hooks in /etc/ca-certificates/update.d... done.
still has the same outcome
I've moved a bit further on the testing part. @MoritzMuehlenhoff showed me this repo which was a bit outdated, I've built the Go binary with Go 1.5.4 to be able to use the x509ignoreCN=0 value of GODEBUG environment variable as using CN instead of SANs is supposed to be deprecated from 1.5:
$ gox509verify db1139.eqiad.wmnet /etc/ssl/certs/wmf-ca-certificates.crt db1139.crt panic: failed to verify certificate: x509: certificate relies on legacy Common Name field, use SANs instead
This is builtin in the binary in cumin1001:
$ git diff diff --git a/gox509verify/main.go b/gox509verify/main.go index aed8abc..e7b5580 100644 --- a/gox509verify/main.go +++ b/gox509verify/main.go @@ -51,5 +51,6 @@ func main() { fmt.Println("Usage: " + os.Args[0] + " dns-name-to-verify.example.org ca.crt server_cert.crt") os.Exit(1) } + os.Setenv("GODEBUG", "x509ignoreCN=0") VerifyCert(os.Args[1], os.Args[2], os.Args[3]) }
I fetched the certificates using:
$ openssl s_client -connect db1133.eqiad.wmnet:3306 -starttls mysql -showcerts > db1133.crt $ openssl s_client -connect db1139.eqiad.wmnet:3311 -starttls mysql -showcerts > db1139.crt
And was then able to test it against our ca bundle:
$ gox509verify db1139.eqiad.wmnet /etc/ssl/certs/wmf-ca-certificates.crt db1139.crt OK $ gox509verify db1133.eqiad.wmnet /etc/ssl/certs/wmf-ca-certificates.crt db1133.crt panic: failed to verify certificate: x509: certificate signed by unknown authority goroutine 1 [running]: main.VerifyCert(0x7fff45ddfad3, 0x12, 0x7fff45ddfae6, 0x26, 0x7fff45ddfb0d, 0xa) /app/gox509verify/main.go:43 +0x68c main.main() /app/gox509verify/main.go:55 +0x214 goroutine 17 [syscall, locked to thread]: runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1
Which is reproductible with openssl:
$ openssl verify -verbose -CAfile /etc/ssl/certs/wmf-ca-certificates.crt db1133.crt CN = db1133.eqiad.wmnet error 20 at 0 depth lookup: unable to get local issuer certificate error db1133.crt: verification failed
vs
$ openssl verify -verbose -CAfile /etc/ssl/certs/wmf-ca-certificates.crt db1139.crt db1139.crt: OK
I've also tried restarting MariaDB on db1133 as it was using Puppet's ssl_ca:
root@db1139.eqiad.wmnet[(none)]> select @@GLOBAL.ssl_ca; +---------------------------------------+ | @@GLOBAL.ssl_ca | +---------------------------------------+ | /etc/ssl/certs/Puppet_Internal_CA.pem | +---------------------------------------+
Which was then modified to the value matching the config file:
root@db1133.eqiad.wmnet[(none)]> select @@GLOBAL.ssl_ca; +----------------------------------------+ | @@GLOBAL.ssl_ca | +----------------------------------------+ | /etc/ssl/certs/wmf-ca-certificates.crt | +----------------------------------------+
and then tried again:
$ openssl s_client -connect db1133.eqiad.wmnet:3306 -starttls mysql -showcerts > db1133_new.crt $ gox509verify db1133.eqiad.wmnet /etc/ssl/certs/wmf-ca-certificates.crt db1133_new.crt panic: failed to verify certificate: x509: certificate signed by unknown authority goroutine 1 [running]: main.VerifyCert(0x7fffd3511acf, 0x12, 0x7fffd3511ae2, 0x26, 0x7fffd3511b09, 0xe) /app/gox509verify/main.go:43 +0x68c main.main() /app/gox509verify/main.go:55 +0x214 goroutine 17 [syscall, locked to thread]: runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1721 +0x1
to get a similar result. Which makes me think that it might come from something a bit more exotic than just an outdated SSL CA load from Orchestrator's side
to sum it up as it's a bit confusing to re-read everything:
puppet5 (db1139) | puppet 7 (db1133) | |
mysql --ssl-ca wmf-ca-certificates.crt | 🟩 | 🟩 |
db-mysql using wmf-ca-certificates.crt | 🟩 | 🟩 |
openssl verify | 🟩 | 🟥 |
gox509verify | 🟩 | 🟥 |
as for the certificates side:
Puppet 7 certificate | Puppet 5 certificate | wmf certificate | |
Puppet7 ca.crt content | 🟩 | 🟥 | 🟩 |
Puppet5 ca.crt content | 🟥 | 🟩 | 🟥 |
WMF ca.crt content | 🟥 | 🟩 | 🟩 |
No, good catch! I forgot to add those results as well. Previous results were from the previously described tests.
From orchestrator the status is:
Puppet 7 ca.crt puppet_rsa | Puppet 5 ca.crt palladium.eqiad.wmnet | wmf-ca.crt Wikimedia_Internal_Root_CA | |
Orchestrator using wmf-ca.crt Wikimedia_Internal_Root_CA | 🟥 | 🟩 | 🟥 |
I'll try to restart orchestrator with MySQLOrchestratorSSLCAFile and MySQLTopologySSLCAFile pointing to puppet_rsa from Puppet 7 CA to see if it fixes the situation with db1133
If db1133 gets fixed, that should mean that the new dbstores (1008, 1009) should pop up and get discovered automatically too.
root@dborch1001:/etc/ssl/certs# grep -i ca-certificates /etc/orchestrator.conf.json "MySQLOrchestratorSSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt", "MySQLTopologySSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt", root@dborch1001:/etc/ssl/certs# ls /etc/ssl/certs/puppet7.crt -l -rw-r--r-- 1 root root 1867 Jan 15 10:42 /etc/ssl/certs/puppet7.crt root@dborch1001:/etc/ssl/certs# sed -i s/wmf-ca-certificates/puppet7/g /etc/orchestrator.conf.json root@dborch1001:/etc/ssl/certs# grep -i ca-certificates /etc/orchestrator.conf.json
root@dborch1001:/etc/ssl/certs# journalctl -fln5000 -u orchestrator |grep puppet Jan 15 10:44:34 dborch1001 orchestrator[614425]: 2024-01-15 10:44:34 INFO Read in CA file: /etc/ssl/certs/puppet7.crt Jan 15 10:44:34 dborch1001 orchestrator[614425]: Read in CA file: /etc/ssl/certs/puppet7.crt
which still triggers:
Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1008.eqiad.wmnet:3317) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1009.eqiad.wmnet:3316) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1009.eqiad.wmnet:3318) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1009.eqiad.wmnet:3320) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1008.eqiad.wmnet:3317) instance is nil in 0.045s (Backend: 0.006s, Instance: 0.039s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1008.eqiad.wmnet:3317) instance is nil in 0.045s (Backend: 0.006s, Instance: 0.039s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1008.eqiad.wmnet:3315) instance is nil in 0.047s (Backend: 0.008s, Instance: 0.038s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1008.eqiad.wmnet:3315) instance is nil in 0.047s (Backend: 0.008s, Instance: 0.038s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3320) instance is nil in 0.050s (Backend: 0.009s, Instance: 0.041s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1009.eqiad.wmnet:3320) instance is nil in 0.050s (Backend: 0.009s, Instance: 0.041s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3318) instance is nil in 0.045s (Backend: 0.008s, Instance: 0.037s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1009.eqiad.wmnet:3318) instance is nil in 0.045s (Backend: 0.008s, Instance: 0.037s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3316) instance is nil in 0.062s (Backend: 0.017s, Instance: 0.044s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1009.eqiad.wmnet:3316) instance is nil in 0.062s (Backend: 0.017s, Instance: 0.044s), error=x509: issuer name does not match subject from issuing certificate Jan 15 10:45:43 dborch1001 orchestrator[614425]: 2024-01-15 10:45:43 ERROR ReadTopologyInstance(dbstore1003.eqiad.wmnet:3315) show global status like 'Uptime': dial tcp 10.64.0.137:3315: connect: connection refused Jan 15 10:45:43 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1003.eqiad.wmnet:3315) show global status like 'Uptime': dial tcp 10.64.0.137:3315: connect: connection refused Jan 15 10:45:43 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1003.eqiad.wmnet:3315) instance is nil in 0.016s (Backend: 0.007s, Instance: 0.008s), error=dial tcp 10.64.0.137:3315: connect: connection refused Jan 15 10:45:43 dborch1001 orchestrator[614425]: 2024-01-15 10:45:43 WARNING DiscoverInstance(dbstore1003.eqiad.wmnet:3315) instance is nil in 0.016s (Backend: 0.007s, Instance: 0.008s), error=dial tcp 10.64.0.137:3315: connect: connection refused
despite:
root@dbstore1008:s1[(none)]> select @@global.ssl_ca; +----------------------------------------+ | @@global.ssl_ca | +----------------------------------------+ | /etc/ssl/certs/wmf-ca-certificates.crt | +----------------------------------------+ 1 row in set (0.000 sec)
I've reverted orchestrator to the previous config and restarted it
Puppet 7 ca.crt puppet_rsa | Puppet 5 ca.crt palladium.eqiad.wmnet | wmf-ca.crt Wikimedia_Internal_Root_CA | |
Orchestrator using wmf-ca.crt Wikimedia_Internal_Root_CA | 🟥 | 🟩 | 🟥 |
Orchestrator using Puppet 7 ca.crt puppet_rsa (without wmf-ca.crt) | 🟥 | 🟩 | 🟥 |
I ran the following test: with a custom PKI, a server certificate generated with an intermediate CA and the CA bundle fed to Orchestrator's MySQLOrchestratorSSLCAFile and MySQLTopologySSLCAFile, a quick restart and an API Call on /api/discover/db1133.eqiad.wmnet/3306 gives us that json output:
{"Code":"OK","Message":"Instance discovered: db1133.eqiad.wmnet:3306","Details":{"Key":{"Hostname":"db1133.eqiad.wmnet","Port":3306},"InstanceAlias":"","Uptime":59,"ServerID":171974728,"ServerUUID":"","Version":"10.6.12-MariaDB-log","VersionComment":"MariaDB Server","FlavorName":"","ReadOnly":true,"Binlog_format":"STATEMENT","BinlogRowImage":"","LogBinEnabled":true,"LogSlaveUpdatesEnabled":true,"LogReplicationUpdatesEnabled":true,"SelfBinlogCoordinates":{"LogFile":"db1133-bin.000015","LogPos":375,"Type":0},"MasterKey":{"Hostname":"db1125.eqiad.wmnet","Port":3306},"MasterUUID":"No","AncestryUUID":"","IsDetachedMaster":false,"Slave_SQL_Running":false,"ReplicationSQLThreadRuning":false,"Slave_IO_Running":false,"ReplicationIOThreadRuning":false,"ReplicationSQLThreadState":0,"ReplicationIOThreadState":0,"HasReplicationFilters":false,"GTIDMode":"","SupportsOracleGTID":false,"UsingOracleGTID":false,"UsingMariaDBGTID":false,"UsingPseudoGTID":false,"ReadBinlogCoordinates":{"LogFile":"db1125-bin.000007","LogPos":142815,"Type":0},"ExecBinlogCoordinates":{"LogFile":"db1125-bin.000007","LogPos":142815,"Type":0},"IsDetached":false,"RelaylogCoordinates":{"LogFile":"db1133-relay-bin.000001","LogPos":4,"Type":1},"LastSQLError":"","LastIOError":"","SecondsBehindMaster":{"Int64":0,"Valid":false},"SQLDelay":0,"ExecutedGtidSet":"","GtidPurged":"","GtidErrant":"","SlaveLagSeconds":{"Int64":3020400,"Valid":true},"ReplicationLagSeconds":{"Int64":3020400,"Valid":true},"SlaveHosts":[],"Replicas":[],"ClusterName":"db1125.eqiad.wmnet:3306","SuggestedClusterAlias":"test-s4","DataCenter":"eqiad","Region":"","PhysicalEnvironment":"","ReplicationDepth":1,"IsCoMaster":false,"HasReplicationCredentials":true,"ReplicationCredentialsAvailable":false,"SemiSyncAvailable":true,"SemiSyncPriority":0,"SemiSyncMasterEnabled":false,"SemiSyncReplicaEnabled":true,"SemiSyncMasterTimeout":100,"SemiSyncMasterWaitForReplicaCount":0,"SemiSyncMasterStatus":false,"SemiSyncMasterClients":0,"SemiSyncReplicaStatus":false,"LastSeenTimestamp":"","IsLastCheckValid":true,"IsUpToDate":true,"IsRecentlyChecked":true,"SecondsSinceLastSeen":{"Int64":0,"Valid":false},"CountMySQLSnapshots":0,"IsCandidate":false,"PromotionRule":"neutral","IsDowntimed":false,"DowntimeReason":"","DowntimeOwner":"","DowntimeEndTimestamp":"","ElapsedDowntime":0,"UnresolvedHostname":"","AllowTLS":true,"Problems":[],"LastDiscoveryLatency":81722265,"ReplicationGroupName":"","ReplicationGroupIsSinglePrimary":false,"ReplicationGroupMemberState":"","ReplicationGroupMemberRole":"","ReplicationGroupMembers":[],"ReplicationGroupPrimaryInstanceKey":{"Hostname":"","Port":0}}}
Those files are available for testing:
ls -l /etc/mysql/ssl/test_* /etc/ssl/certs/test_chain.crt -rw-r--r-- 1 root root 1725 Jan 15 16:10 /etc/mysql/ssl/test_cert.pem -rw-r--r-- 1 root root 1705 Jan 15 16:11 /etc/mysql/ssl/test_server.key -rw-r--r-- 1 root root 4172 Jan 15 15:36 /etc/ssl/certs/test_chain.crt
also on orchestrator /etc/ssl/certs/test_chain.crt is available
Nice! Out of interest, which PKI tool did you use for your tests? As a next step we could test this with our PKI (and introduce a Hiera flag for opt-in tests)
I created a basic PKI with plain OpenSSL commands. how do you think we should proceed?
Let's create a separate task for switching Orchestrator to the PKI-issued cert, we're way beyond the initial scope of the current ask anyway :-)
https://phabricator.wikimedia.org/T350686 can be used as a rough template (that was used to move Ganeti to the PKI) and https://wikitech.wikimedia.org/wiki/PKI/Clients (and https://wikitech.wikimedia.org/wiki/PKI in general) had some more docs. One issue that needs to be handled is the cert rollover. If the PKI issues a new cert we can't simply trigger mysqld restarts to pick up the new cert.
Nice finding Arnaud!
+1.
Feel free to close this and we can follow up on that new task.
https://phabricator.wikimedia.org/T350686 can be used as a rough template (that was used to move Ganeti to the PKI) and https://wikitech.wikimedia.org/wiki/PKI/Clients (and https://wikitech.wikimedia.org/wiki/PKI in general) had some more docs. One issue that needs to be handled is the cert rollover. If the PKI issues a new cert we can't simply trigger mysqld restarts to pick up the new cert.
Good stuff!
Does this mean that we can decline T354411: Revert dbstore migration from puppet7 to puppet5 now, or would you still prefer that role to be migrated back to puppet 5?
Let's leave stalled for now. If we can make orchestrator see dbstore hosts, we can decline it. If not, we'll need to revert to puppet 5.