Page MenuHomePhabricator

puppet7 on cumin breaks database connections
Closed, ResolvedPublic

Description

Connections fail with:

root@cumin1001:~# db-mysql db2110
ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain

Event Timeline

Marostegui triaged this task as Medium priority.Dec 7 2023, 11:14 AM
Marostegui created this task.
Marostegui moved this task from Triage to In progress on the DBA board.

cumin1001 has been reverted to Puppet 5, but cumin2002 is on Puppet 7 and can be used to reproduce.

db1124 can be used for testing. It is a test host running puppet 7. It can be restarted, rebooted, reimaged, whatever is needed

Just took a quick look:

# db-mysql db1133
ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain
mysql:root@localhost [(none)]> show global variables like 'ssl_ca%';
+---------------+---------------------------------------+
| Variable_name | Value                                 |
+---------------+---------------------------------------+
| ssl_ca        | /etc/ssl/certs/Puppet_Internal_CA.pem |
| ssl_capath    |                                       |
+---------------+---------------------------------------+
2 rows in set (0.001 sec)

However:

root@db1133:~# cat /etc/my.cnf | grep ssl-ca
ssl-ca=/etc/ssl/certs/wmf-ca-certificates.crt

I've tried changing that to:

ssl-ca=/etc/ssl/certs/Puppet_Internal_CA.pem

And after restarting mariadb, I can now connect from cumin2002 (puppet7) to db1133 (puppet7). However (of course) cumin1001 (puppet5) now fails to connect to db1133.
The same happens with db1124 (puppet7) and the above procedure.

This needs more investigation

This has more implications, as orchestrator cannot see these hosts (db1124, db1133) (with the changed cert). So this really needs lots of carefulness

15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 ERROR ReadTopologyInstance(db1124.eqiad.wmnet:3306) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Dec  7 12:07:15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 WARNING DiscoverInstance(db1124.eqiad.wmnet:3306) instance is nil in 0.069s (Backend: 0.015s, Instance: 0.055s), error=x509: issuer name does not match subject from issuing certificate
Dec  7 12:07:25 dborch1001 orchestrator[425]: 2023-12-07 12:07:25 ERROR ReadTopologyInstance(db1133.eqiad.wmnet:3306) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Dec  7 12:07:25 dborch1001 orchestrator[425]: 2023-12-07 12:07:25 WARNING DiscoverInstance(db1133.eqiad.wmnet:3306) instance is nil in 0.098s (Backend: 0.017s, Instance: 0.082s), error=x509: issuer name does not match subject from issuing certificate
ABran-WMF changed the task status from Open to In Progress.Dec 7 2023, 3:49 PM
ABran-WMF claimed this task.

image.png (885×3 px, 549 KB)
testing db-mysql commands directly in context with the 2 CA reproduces this issue, it is possible that there is an issue in Puppet 7 CA.pem

which could also be the root cause for this error from dborch as it uses that cert to connect:

15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 ERROR ReadTopologyInstance(db1124.eqiad.wmnet:3306) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Dec  7 12:07:15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 WARNING DiscoverInstance(db1124.eqiad.wmnet:3306) instance is nil in 0.069s (Backend: 0.015s, Instance: 0.055s), error=x509: issuer name does not match subject from issuing certificate
Dec  7 12:07:25 dborch1001 orchestrator[425]: 2023-12-07 12:07:25 ERROR ReadTopologyInstance(db1133.eqiad.wmnet:3306) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Dec  7 12:07:25 dborch1001 orchestrator[425]: 2023-12-07 12:07:25 WARNING DiscoverInstance(db1133.eqiad.wmnet:3306) instance is nil in 0.098s (Backend: 0.017s, Instance: 0.082s), error=x509: issuer name does not match subject from issuing certificate

one other interesting fact:
a puppet 7 host

db1124 can be used for testing. It is a test host running puppet 7. It can be restarted, rebooted, reimaged, whatever is needed

has the exact opposite situation where both /etc/ssl/certs/Puppet_Internal_CA.pem and /etc/ssl/certs/wmf-ca-certificates.crt are OK to be used to connect from cumin2002, whereas from cumin1001:

root@cumin1001:~# sudo db-mysql db1124 --ssl-ca /etc/ssl/certs/Puppet_Internal_CA.pem
ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain
root@cumin1001:~# sudo db-mysql db1124 --ssl-ca /etc/ssl/certs/wmf-ca-certificates.crt
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 42037
Server version: 10.6.14-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@db1124.eqiad.wmnet[(none)]>

it appears that most of our hosts are still using /etc/ssl/certs/Puppet_Internal_CA.pem and should be migrated to use /etc/ssl/certs/wmf-ca-certificates.crt instead. What puzzles me is that orchestrator uses /etc/ssl/certs/wmf-ca-certificates.crt which has been consistently OK to connect to hosts as we've been able to see here :

and here:

so it does not explain why orchestrator fails to connect to some db hosts:

$ sudo journalctl -n 500 -u orchestrator|grep -i x509|grep -iEo '\Sdb.*.*wmnet'|sort|uniq
(db1124.eqiad.wmnet
(db1133.eqiad.wmnet
LSobanski subscribed.

I believe the collab tag was added automatically from the parent task so removing it.

@ABran-WMF @MoritzMuehlenhoff we are going to have to give this more priority, dbstore1003 (s1) is now failing in orchestrator as being restarted during the Christmas break as part of: https://phabricator.wikimedia.org/T351921#9426477

The instance is back and running normally but orchestrator has been failing since the restart with:

n  2 06:06:22 dborch1001 orchestrator[3587041]: 2024-01-02 06:06:22 ERROR ReadTopologyInstance(dbstore1003.eqiad.wmnet:3311) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Jan  2 06:06:22 dborch1001 orchestrator[3587041]: ReadTopologyInstance(dbstore1003.eqiad.wmnet:3311) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Jan  2 06:06:22 dborch1001 orchestrator[3587041]: 2024-01-02 06:06:22 WARNING DiscoverInstance(dbstore1003.eqiad.wmnet:3311) instance is nil in 0.024s (Backend: 0.003s, Instance: 0.021s), error=x509: issuer name does not match subject from issuing certificate
Jan  2 06:06:22 dborch1001 orchestrator[3587041]: DiscoverInstance(dbstore1003.eqiad.wmnet:3311) instance is nil in 0.024s (Backend: 0.003s, Instance: 0.021s), error=x509: issuer name does not match subject from issuing certificate

And as expected, it also fails from cumin1001:

[06:09:43] marostegui@cumin1001:~$ sudo db-mysql dbstore1003:3311
ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain

To confirm, this host runs puppet7.

dbstore1003 is a MariaDB server containing replicas of mediawiki databases for analytics & research usage (mariadb::analytics_replica)
DB section s1 (alias: mysql.s1)
DB section s5 (alias: mysql.s5)
DB section s7 (alias: mysql.s7)
Bare Metal host on site eqiad and rack A3
Host has been migrated to puppet7
Marostegui raised the priority of this task from Medium to High.Jan 5 2024, 10:38 AM

This can also prevent schema changes to be fully applied to all the replicas.

it appears that most of our hosts are still using /etc/ssl/certs/Puppet_Internal_CA.pem and should be migrated to use `/etc/ssl/certs/wmf-ca-certificates.crt

This is likely the issue something, somewhere likley is still using Puppet_internal_ca.pem, /var/lib/puppet/ssl/ca/ca.pem or $facts['pupet_config']['localcecert'] directly

it appears that most of our hosts are still using /etc/ssl/certs/Puppet_Internal_CA.pem and should be migrated to use `/etc/ssl/certs/wmf-ca-certificates.crt

This is likely the issue something, somewhere likley is still using Puppet_internal_ca.pem, /var/lib/puppet/ssl/ca/ca.pem or $facts['pupet_config']['localcecert'] directly

I think another factor is the long-running nature of mysqld processes since any config change would only get picked up after a daemon restart.

it appears that most of our hosts are still using /etc/ssl/certs/Puppet_Internal_CA.pem and should be migrated to use `/etc/ssl/certs/wmf-ca-certificates.crt

This is likely the issue something, somewhere likley is still using Puppet_internal_ca.pem, /var/lib/puppet/ssl/ca/ca.pem or $facts['pupet_config']['localcecert'] directly

I think another factor is the long-running nature of mysqld processes since any config change would only get picked up after a daemon restart.

This is also happening to brand new installed hosts (like dbstore1008).

Could it be this reference in wmfdb that should be updated to /etc/ssl/certs/wmf-ca-certificates.crt?
https://gitlab.wikimedia.org/repos/sre/wmfdb/-/blob/main/wmfdb/mysql_cli.py?ref_type=heads#L6

image.png (185×1 px, 22 KB)

I've not really had much to do with wmfdb yet, but it would seem that if we update this and deploy a new version of the wmfdb package, that would allow db-mysql to work with both puppet 5 and puppet 7 based hosts.

That's a very good point @BTullis. I'd leave this to @ABran-WMF and @MoritzMuehlenhoff.
Orchestrator is still an issue though (which has nothing to do with wmfdb)

it seems that orchestrator follows the same pattern as the one @Marostegui identified here:

I'll look for a fix. In the meantime, thank you @BTullis! Here is the merge request of your suggestion

for the orchestrator part, it seems that mariadb client should be using the properwmf-ca-certificatesfile.

All certificates seem to be present on dborch1001:

# ls -l /usr/local/share/ca-certificates/Puppet_Internal_CA.crt
-r--r--r-- 1 root root 2977 Nov 23 11:02 /usr/local/share/ca-certificates/Puppet_Internal_CA.crt
root@dborch1001:~# ls -l /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt
-rw-r--r-- 1 root root 1111 Nov 20 17:43 /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt
root@dborch1001:/home/arnaudb# grep -i cafile /etc/orchestrator.conf.json
  "MySQLOrchestratorSSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
  "MySQLTopologySSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
root@dborch1001:/home/arnaudb# ls -lahrt /etc/ssl/certs|grep -iE 'wiki|pup|wmf'
lrwxrwxrwx 1 root root   22 Mar 22  2023 c5aaad6f.0 -> Puppet_Internal_CA.pem
lrwxrwxrwx 1 root root   53 Mar 22  2023 wmf_ca_2017_2020.pem -> /usr/local/share/ca-certificates/wmf_ca_2017_2020.crt
lrwxrwxrwx 1 root root   20 Mar 22  2023 e3f15d55.0 -> wmf_ca_2017_2020.pem
lrwxrwxrwx 1 root root   67 Nov 21 10:58 Wikimedia_Internal_Root_CA.pem -> /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt
lrwxrwxrwx 1 root root   60 Nov 21 10:58 Puppet5_Internal_CA.pem -> /usr/share/ca-certificates/wikimedia/Puppet5_Internal_CA.crt
lrwxrwxrwx 1 root root   30 Nov 21 10:58 c0cdb94e.0 -> Wikimedia_Internal_Root_CA.pem
lrwxrwxrwx 1 root root   55 Nov 23 11:02 Puppet_Internal_CA.pem -> /usr/local/share/ca-certificates/Puppet_Internal_CA.crt
lrwxrwxrwx 1 root root   23 Nov 23 11:02 c5aaad6f.1 -> Puppet5_Internal_CA.pem
-rw-r--r-- 1 root root 3.0K Nov 23 11:02 wmf-ca-certificates.crt

Side note: I noticed that wmf-ca-certificates.crt is absent from /usr/local/share/ca-certificates

On db1215 (zarcillo/orchestrator database) side:

arnaudb@db1215:~ $ grep -i ssl-ca /etc/my.cnf
ssl-ca=/etc/ssl/certs/wmf-ca-certificates.crt
arnaudb@db1215:~ $ sudo mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 175167991
Server version: 10.6.12-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>  show global variables like 'ssl_ca%'\G
*************************** 1. row ***************************
Variable_name: ssl_ca
        Value: /etc/ssl/certs/Puppet_Internal_CA.pem
*************************** 2. row ***************************
Variable_name: ssl_capath
        Value:

we have the old config running, and orchestrator is able to connect to its database. So it might be something else different of this:

I'll look for a fix. In the meantime, thank you @BTullis! Here is the merge request of your suggestion

wmfdb has been released, I'll move on to orchestrator test/debugging

@ABran-WMF would you deploy that new version to cumin1001?

oh shoot I have to build it to bullseye as well! let me check

@MoritzMuehlenhoff what is the plan with cumin1002? @ABran-WMF has fixed db-mysql on cumin1001 so we can connect from there to both puppet5 and puppet7. If we could migrate cumin1001 to puppet7 and once confirmed it all works, drop cumin1002 that'd be ideal, otherwise we'd need to deploy grants for cumin1002 too across all the databases.

@MoritzMuehlenhoff what is the plan with cumin1002? @ABran-WMF has fixed db-mysql on cumin1001 so we can connect from there to both puppet5 and puppet7. If we could migrate cumin1001 to puppet7 and once confirmed it all works, drop cumin1002 that'd be ideal, otherwise we'd need to deploy grants for cumin1002 too across all the databases.

Unfortunately we'll have to update grants: The move towards cumin1002 (which is a VM) was only partly motivated by the split setup wrt Puppet 7: cumin1001 (which runs on hardware) is almost six years old and way out of warranty. Last year we decided to move one of the cumin hosts to a Ganeti VM (and keep the other one on baremetal so that we have a safety net in case ganeti is fully down), but hadn't found the time to work on that, so when the situation with Puppet 7 and DB admin access happened, I killed two birds with one stone. On the upside, now that cumin/eqiad is on VM, we'll no longer need to bother with hardware replacements in the future.

@MoritzMuehlenhoff what is the plan with cumin1002? @ABran-WMF has fixed db-mysql on cumin1001 so we can connect from there to both puppet5 and puppet7. If we could migrate cumin1001 to puppet7 and once confirmed it all works, drop cumin1002 that'd be ideal, otherwise we'd need to deploy grants for cumin1002 too across all the databases.

Unfortunately we'll have to update grants: The move towards cumin1002 (which is a VM) was only partly motivated by the split setup wrt Puppet 7: cumin1001 (which runs on hardware) is almost six years old and way out of warranty. Last year we decided to move one of the cumin hosts to a Ganeti VM (and keep the other one on baremetal so that we have a safety net in case ganeti is fully down), but hadn't found the time to work on that, so when the situation with Puppet 7 and DB admin access happened, I killed two birds with one stone. On the upside, now that cumin/eqiad is on VM, we'll no longer need to bother with hardware replacements in the future.

Got it, no problem - we'll add the cumin1002 grants and then once cumin1001 is gone, remove those.

Change 989144 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production.sql.erb: Add cumin1002

https://gerrit.wikimedia.org/r/989144

Change 989144 merged by Marostegui:

[operations/puppet@production] production.sql.erb: Add cumin1002

https://gerrit.wikimedia.org/r/989144

@MoritzMuehlenhoff I'm currently trying to trace how orchestrator connects to databases to manage them, to identify which certificate is really used.

it's supposed to be:

grep -i cafile /etc/orchestrator.conf.json
  "MySQLOrchestratorSSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
  "MySQLTopologySSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",

which seems to be valid:

md5sum /etc/ssl/certs/wmf-ca-certificates.crt
491c425507b080960b6ba8255d7cff46  /etc/ssl/certs/wmf-ca-certificates.crt

and working properly from cumin1001:

arnaudb@cumin1001:~ $ sudo db-mysql dbstore1008:3317
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1736065
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

but not from Orchestrator

@MoritzMuehlenhoff I'm currently trying to trace how orchestrator connects to databases to manage them, to identify which certificate is really used.

it's supposed to be:

grep -i cafile /etc/orchestrator.conf.json
  "MySQLOrchestratorSSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
  "MySQLTopologySSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",

That got configured via https://gerrit.wikimedia.org/r/c/operations/puppet/+/972367/

but not from Orchestrator

Can you elaborate what works and what not? Specific operations? To all hosts or just the ones also running Puppet 7? Orchestrator was switched to Puppet 7 on Nov 23 so it appears to me it can't be fully broken I'd expect.

@MoritzMuehlenhoff I'm currently trying to trace how orchestrator connects to databases to manage them, to identify which certificate is really used.
but not from Orchestrator

Can you elaborate what works and what not? Specific operations? To all hosts or just the ones also running Puppet 7? Orchestrator was switched to Puppet 7 on Nov 23 so it appears to me it can't be fully broken I'd expect.

I've been able to connect with the following:

mysql --ssl-ca /etc/ssl/certs/wmf-ca-certificates.crt -h dbstore1008.eqiad.wmnet -P3311 -u orchestrator -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2595824
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

which is supposed to be the chain used by orchestrator

which makes me think that there has been some mishap on configuration or orchestrator restart maybe?

[edit] for a bit more context:
dbstore1008 is one of the hosts that is unreachable and therefore not yet referenced on Orchestrator

I see these entries in the logs from orchestrator.

Jan 09 14:43:15 dborch1001 orchestrator[3587041]: 2024-01-09 14:43:15 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3318) instance is nil in 0.045s (Backend: 0.004s, Instance: 0.041s), error=x509: issuer name does not match subject from issuing certificate

Jan 09 14:43:18 dborch1001 orchestrator[3587041]: 2024-01-09 14:43:18 ERROR ReadTopologyInstance(dbstore1008.eqiad.wmnet:3311) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate

@MoritzMuehlenhoff I'm currently trying to trace how orchestrator connects to databases to manage them, to identify which certificate is really used.
but not from Orchestrator

Can you elaborate what works and what not? Specific operations? To all hosts or just the ones also running Puppet 7? Orchestrator was switched to Puppet 7 on Nov 23 so it appears to me it can't be fully broken I'd expect.

I've been able to connect with the following:

mysql --ssl-ca /etc/ssl/certs/wmf-ca-certificates.crt -h dbstore1008.eqiad.wmnet -P3311 -u orchestrator -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2595824
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

which is supposed to be the chain used by orchestrator

which makes me think that there has been some mishap on configuration or orchestrator restart maybe?

https://gerrit.wikimedia.org/r/c/operations/puppet/+/972367/ was deployed November 09, but dborch1001 has an uptime of 47 days, so the change must already be in effect restart-wise. Still a restart of Orchestrator won't hurt I guess.

One other option is that the TLS toolchain as used by Orchestrator be not handle a bundled certificate file correctly, so that when wmf-ca-certificates.crt is used, only Puppet_Internal_CA.pem is detected/passed to the DB server (which would explain while it fails to connect to hosts unknown to Puppet 7).

To confirm we could also either spin up a second Orchestrator instance (since we don't seem to have a test instance) or temporarily only configure the new cert (to confirm that in this case only Puppet7 servers would be reachable).

In case it helps, this is also a useful command for showing the certificate chain that is presented by the dbstore servers.

btullis@dborch1001:~$ openssl s_client -connect dbstore1008.eqiad.wmnet:3311 -starttls mysql -showcerts
CONNECTED(00000003)
depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
verify return:1
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa
verify return:1
depth=0 CN = dbstore1008.eqiad.wmnet
verify return:1
---
Certificate chain
 0 s:CN = dbstore1008.eqiad.wmnet
   i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa
<snip snip snip>

One other option is that the TLS toolchain as used by Orchestrator be not handle a bundled certificate file correctly, so that when wmf-ca-certificates.crt is used, only Puppet_Internal_CA.pem is detected/passed to the DB server (which would explain while it fails to connect to hosts unknown to Puppet 7).

To confirm we could also either spin up a second Orchestrator instance (since we don't seem to have a test instance) or temporarily only configure the new cert (to confirm that in this case only Puppet7 servers would be reachable).

Orchestrator does load the ca bundle:

Dec 14 08:51:50 dborch1001 orchestrator[3587041]: 2023-12-14 08:51:50 INFO Read in CA file: /etc/ssl/certs/wmf-ca-certificates.crt
Dec 14 08:51:50 dborch1001 orchestrator[3587041]: Read in CA file: /etc/ssl/certs/wmf-ca-certificates.crt

from this method but I haven't been able to find information about ca bundle issues with go's "crypto/tls" package. A GET call on this URL reproduces our issue with db1215's replica which is impacted by this issue as well.
Maybe it also has something to do with:

Side note: I noticed that wmf-ca-certificates.crt is absent from /usr/local/share/ca-certificates

I'm not sure spinning up a new orchestrator instance specifically for this would be needed, but in the longer term it could be a good tool to help us debug.

Maybe it also has something to do with:

Side note: I noticed that wmf-ca-certificates.crt is absent from /usr/local/share/ca-certificates

it does not!

$ sudo cp /etc/ssl/certs/wmf-ca-certificates.crt /usr/local/share/ca-certificates/ 
$ sudo update-ca-certificates
Updating certificates in /etc/ssl/certs...
rehash: warning: skipping wmf-ca-certificates.crt,it does not contain exactly one certificate or CRL
rehash: warning: skipping wmf-ca-certificates.pem,it does not contain exactly one certificate or CRL
rehash: warning: skipping Puppet_Internal_CA.pem,it does not contain exactly one certificate or CRL
2 added, 1 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.

still has the same outcome

I've moved a bit further on the testing part. @MoritzMuehlenhoff showed me this repo which was a bit outdated, I've built the Go binary with Go 1.5.4 to be able to use the x509ignoreCN=0 value of GODEBUG environment variable as using CN instead of SANs is supposed to be deprecated from 1.5:

$ gox509verify db1139.eqiad.wmnet  /etc/ssl/certs/wmf-ca-certificates.crt db1139.crt 
panic: failed to verify certificate: x509: certificate relies on legacy Common Name field, use SANs instead

This is builtin in the binary in cumin1001:

$ git diff
diff --git a/gox509verify/main.go b/gox509verify/main.go
index aed8abc..e7b5580 100644
--- a/gox509verify/main.go
+++ b/gox509verify/main.go
@@ -51,5 +51,6 @@ func main() {
                fmt.Println("Usage: " + os.Args[0] + " dns-name-to-verify.example.org ca.crt server_cert.crt")
                os.Exit(1)
        }
+       os.Setenv("GODEBUG", "x509ignoreCN=0")
        VerifyCert(os.Args[1], os.Args[2], os.Args[3])
 }

I fetched the certificates using:

$ openssl s_client -connect db1133.eqiad.wmnet:3306 -starttls mysql -showcerts > db1133.crt
 $ openssl s_client -connect db1139.eqiad.wmnet:3311 -starttls mysql -showcerts > db1139.crt

And was then able to test it against our ca bundle:

$ gox509verify db1139.eqiad.wmnet  /etc/ssl/certs/wmf-ca-certificates.crt db1139.crt 
OK
$ gox509verify db1133.eqiad.wmnet  /etc/ssl/certs/wmf-ca-certificates.crt db1133.crt 
panic: failed to verify certificate: x509: certificate signed by unknown authority

goroutine 1 [running]:
main.VerifyCert(0x7fff45ddfad3, 0x12, 0x7fff45ddfae6, 0x26, 0x7fff45ddfb0d, 0xa)
	/app/gox509verify/main.go:43 +0x68c
main.main()
	/app/gox509verify/main.go:55 +0x214

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

Which is reproductible with openssl:

$ openssl verify -verbose -CAfile /etc/ssl/certs/wmf-ca-certificates.crt db1133.crt
CN = db1133.eqiad.wmnet
error 20 at 0 depth lookup: unable to get local issuer certificate
error db1133.crt: verification failed

vs

$ openssl verify -verbose -CAfile /etc/ssl/certs/wmf-ca-certificates.crt db1139.crt
db1139.crt: OK

I've also tried restarting MariaDB on db1133 as it was using Puppet's ssl_ca:

root@db1139.eqiad.wmnet[(none)]> select @@GLOBAL.ssl_ca;
+---------------------------------------+
| @@GLOBAL.ssl_ca                       |
+---------------------------------------+
| /etc/ssl/certs/Puppet_Internal_CA.pem |
+---------------------------------------+

Which was then modified to the value matching the config file:

root@db1133.eqiad.wmnet[(none)]> select @@GLOBAL.ssl_ca;
+----------------------------------------+
| @@GLOBAL.ssl_ca                        |
+----------------------------------------+
| /etc/ssl/certs/wmf-ca-certificates.crt |
+----------------------------------------+

and then tried again:

$ openssl s_client -connect db1133.eqiad.wmnet:3306 -starttls mysql -showcerts > db1133_new.crt
$ gox509verify db1133.eqiad.wmnet  /etc/ssl/certs/wmf-ca-certificates.crt db1133_new.crt 
panic: failed to verify certificate: x509: certificate signed by unknown authority

goroutine 1 [running]:
main.VerifyCert(0x7fffd3511acf, 0x12, 0x7fffd3511ae2, 0x26, 0x7fffd3511b09, 0xe)
	/app/gox509verify/main.go:43 +0x68c
main.main()
	/app/gox509verify/main.go:55 +0x214

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1721 +0x1

to get a similar result. Which makes me think that it might come from something a bit more exotic than just an outdated SSL CA load from Orchestrator's side

to sum it up as it's a bit confusing to re-read everything:

puppet5 (db1139)puppet 7 (db1133)
mysql --ssl-ca wmf-ca-certificates.crt🟩🟩
db-mysql using wmf-ca-certificates.crt🟩🟩
openssl verify🟩🟥
gox509verify🟩🟥

as for the certificates side:

Puppet 7 certificatePuppet 5 certificatewmf certificate
Puppet7 ca.crt content🟩🟥🟩
Puppet5 ca.crt content🟥🟩🟥
WMF ca.crt content🟥🟩🟩

image.png (649×2 px, 188 KB)

Those are tests from the orchestrator server I assume?

No, good catch! I forgot to add those results as well. Previous results were from the previously described tests.
From orchestrator the status is:

Puppet 7 ca.crt puppet_rsaPuppet 5 ca.crt palladium.eqiad.wmnetwmf-ca.crt Wikimedia_Internal_Root_CA
Orchestrator using wmf-ca.crt Wikimedia_Internal_Root_CA🟥🟩🟥

I'll try to restart orchestrator with MySQLOrchestratorSSLCAFile and MySQLTopologySSLCAFile pointing to puppet_rsa from Puppet 7 CA to see if it fixes the situation with db1133

If db1133 gets fixed, that should mean that the new dbstores (1008, 1009) should pop up and get discovered automatically too.

root@dborch1001:/etc/ssl/certs# grep -i ca-certificates /etc/orchestrator.conf.json 
  "MySQLOrchestratorSSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
  "MySQLTopologySSLCAFile": "/etc/ssl/certs/wmf-ca-certificates.crt",
root@dborch1001:/etc/ssl/certs# ls /etc/ssl/certs/puppet7.crt -l
-rw-r--r-- 1 root root 1867 Jan 15 10:42 /etc/ssl/certs/puppet7.crt
root@dborch1001:/etc/ssl/certs# sed -i s/wmf-ca-certificates/puppet7/g /etc/orchestrator.conf.json
root@dborch1001:/etc/ssl/certs# grep -i ca-certificates /etc/orchestrator.conf.json
root@dborch1001:/etc/ssl/certs# journalctl -fln5000 -u orchestrator |grep puppet
Jan 15 10:44:34 dborch1001 orchestrator[614425]: 2024-01-15 10:44:34 INFO Read in CA file: /etc/ssl/certs/puppet7.crt
Jan 15 10:44:34 dborch1001 orchestrator[614425]: Read in CA file: /etc/ssl/certs/puppet7.crt

which still triggers:

Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1008.eqiad.wmnet:3317) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1009.eqiad.wmnet:3316) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1009.eqiad.wmnet:3318) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1009.eqiad.wmnet:3320) show global status like 'Uptime': x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1008.eqiad.wmnet:3317) instance is nil in 0.045s (Backend: 0.006s, Instance: 0.039s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1008.eqiad.wmnet:3317) instance is nil in 0.045s (Backend: 0.006s, Instance: 0.039s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1008.eqiad.wmnet:3315) instance is nil in 0.047s (Backend: 0.008s, Instance: 0.038s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1008.eqiad.wmnet:3315) instance is nil in 0.047s (Backend: 0.008s, Instance: 0.038s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3320) instance is nil in 0.050s (Backend: 0.009s, Instance: 0.041s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1009.eqiad.wmnet:3320) instance is nil in 0.050s (Backend: 0.009s, Instance: 0.041s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3318) instance is nil in 0.045s (Backend: 0.008s, Instance: 0.037s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1009.eqiad.wmnet:3318) instance is nil in 0.045s (Backend: 0.008s, Instance: 0.037s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: 2024-01-15 10:45:41 WARNING DiscoverInstance(dbstore1009.eqiad.wmnet:3316) instance is nil in 0.062s (Backend: 0.017s, Instance: 0.044s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:41 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1009.eqiad.wmnet:3316) instance is nil in 0.062s (Backend: 0.017s, Instance: 0.044s), error=x509: issuer name does not match subject from issuing certificate
Jan 15 10:45:43 dborch1001 orchestrator[614425]: 2024-01-15 10:45:43 ERROR ReadTopologyInstance(dbstore1003.eqiad.wmnet:3315) show global status like 'Uptime': dial tcp 10.64.0.137:3315: connect: connection refused
Jan 15 10:45:43 dborch1001 orchestrator[614425]: ReadTopologyInstance(dbstore1003.eqiad.wmnet:3315) show global status like 'Uptime': dial tcp 10.64.0.137:3315: connect: connection refused
Jan 15 10:45:43 dborch1001 orchestrator[614425]: DiscoverInstance(dbstore1003.eqiad.wmnet:3315) instance is nil in 0.016s (Backend: 0.007s, Instance: 0.008s), error=dial tcp 10.64.0.137:3315: connect: connection refused
Jan 15 10:45:43 dborch1001 orchestrator[614425]: 2024-01-15 10:45:43 WARNING DiscoverInstance(dbstore1003.eqiad.wmnet:3315) instance is nil in 0.016s (Backend: 0.007s, Instance: 0.008s), error=dial tcp 10.64.0.137:3315: connect: connection refused

despite:

root@dbstore1008:s1[(none)]> select @@global.ssl_ca;
+----------------------------------------+
| @@global.ssl_ca                        |
+----------------------------------------+
| /etc/ssl/certs/wmf-ca-certificates.crt |
+----------------------------------------+
1 row in set (0.000 sec)

I've reverted orchestrator to the previous config and restarted it

Puppet 7 ca.crt puppet_rsaPuppet 5 ca.crt palladium.eqiad.wmnetwmf-ca.crt Wikimedia_Internal_Root_CA
Orchestrator using wmf-ca.crt Wikimedia_Internal_Root_CA🟥🟩🟥
Orchestrator using Puppet 7 ca.crt puppet_rsa (without wmf-ca.crt)🟥🟩🟥
This comment was removed by ABran-WMF.

I ran the following test: with a custom PKI, a server certificate generated with an intermediate CA and the CA bundle fed to Orchestrator's MySQLOrchestratorSSLCAFile and MySQLTopologySSLCAFile, a quick restart and an API Call on /api/discover/db1133.eqiad.wmnet/3306 gives us that json output:

{"Code":"OK","Message":"Instance discovered: db1133.eqiad.wmnet:3306","Details":{"Key":{"Hostname":"db1133.eqiad.wmnet","Port":3306},"InstanceAlias":"","Uptime":59,"ServerID":171974728,"ServerUUID":"","Version":"10.6.12-MariaDB-log","VersionComment":"MariaDB Server","FlavorName":"","ReadOnly":true,"Binlog_format":"STATEMENT","BinlogRowImage":"","LogBinEnabled":true,"LogSlaveUpdatesEnabled":true,"LogReplicationUpdatesEnabled":true,"SelfBinlogCoordinates":{"LogFile":"db1133-bin.000015","LogPos":375,"Type":0},"MasterKey":{"Hostname":"db1125.eqiad.wmnet","Port":3306},"MasterUUID":"No","AncestryUUID":"","IsDetachedMaster":false,"Slave_SQL_Running":false,"ReplicationSQLThreadRuning":false,"Slave_IO_Running":false,"ReplicationIOThreadRuning":false,"ReplicationSQLThreadState":0,"ReplicationIOThreadState":0,"HasReplicationFilters":false,"GTIDMode":"","SupportsOracleGTID":false,"UsingOracleGTID":false,"UsingMariaDBGTID":false,"UsingPseudoGTID":false,"ReadBinlogCoordinates":{"LogFile":"db1125-bin.000007","LogPos":142815,"Type":0},"ExecBinlogCoordinates":{"LogFile":"db1125-bin.000007","LogPos":142815,"Type":0},"IsDetached":false,"RelaylogCoordinates":{"LogFile":"db1133-relay-bin.000001","LogPos":4,"Type":1},"LastSQLError":"","LastIOError":"","SecondsBehindMaster":{"Int64":0,"Valid":false},"SQLDelay":0,"ExecutedGtidSet":"","GtidPurged":"","GtidErrant":"","SlaveLagSeconds":{"Int64":3020400,"Valid":true},"ReplicationLagSeconds":{"Int64":3020400,"Valid":true},"SlaveHosts":[],"Replicas":[],"ClusterName":"db1125.eqiad.wmnet:3306","SuggestedClusterAlias":"test-s4","DataCenter":"eqiad","Region":"","PhysicalEnvironment":"","ReplicationDepth":1,"IsCoMaster":false,"HasReplicationCredentials":true,"ReplicationCredentialsAvailable":false,"SemiSyncAvailable":true,"SemiSyncPriority":0,"SemiSyncMasterEnabled":false,"SemiSyncReplicaEnabled":true,"SemiSyncMasterTimeout":100,"SemiSyncMasterWaitForReplicaCount":0,"SemiSyncMasterStatus":false,"SemiSyncMasterClients":0,"SemiSyncReplicaStatus":false,"LastSeenTimestamp":"","IsLastCheckValid":true,"IsUpToDate":true,"IsRecentlyChecked":true,"SecondsSinceLastSeen":{"Int64":0,"Valid":false},"CountMySQLSnapshots":0,"IsCandidate":false,"PromotionRule":"neutral","IsDowntimed":false,"DowntimeReason":"","DowntimeOwner":"","DowntimeEndTimestamp":"","ElapsedDowntime":0,"UnresolvedHostname":"","AllowTLS":true,"Problems":[],"LastDiscoveryLatency":81722265,"ReplicationGroupName":"","ReplicationGroupIsSinglePrimary":false,"ReplicationGroupMemberState":"","ReplicationGroupMemberRole":"","ReplicationGroupMembers":[],"ReplicationGroupPrimaryInstanceKey":{"Hostname":"","Port":0}}}

Those files are available for testing:

ls -l /etc/mysql/ssl/test_* /etc/ssl/certs/test_chain.crt 
-rw-r--r-- 1 root root 1725 Jan 15 16:10 /etc/mysql/ssl/test_cert.pem
-rw-r--r-- 1 root root 1705 Jan 15 16:11 /etc/mysql/ssl/test_server.key
-rw-r--r-- 1 root root 4172 Jan 15 15:36 /etc/ssl/certs/test_chain.crt

also on orchestrator /etc/ssl/certs/test_chain.crt is available

I ran the following test: with a custom PKI,

Nice! Out of interest, which PKI tool did you use for your tests? As a next step we could test this with our PKI (and introduce a Hiera flag for opt-in tests)

Nice! Out of interest, which PKI tool did you use for your tests? As a next step we could test this with our PKI (and introduce a Hiera flag for opt-in tests)

I created a basic PKI with plain OpenSSL commands. how do you think we should proceed?

Nice! Out of interest, which PKI tool did you use for your tests? As a next step we could test this with our PKI (and introduce a Hiera flag for opt-in tests)

I created a basic PKI with plain OpenSSL commands. how do you think we should proceed?

Let's create a separate task for switching Orchestrator to the PKI-issued cert, we're way beyond the initial scope of the current ask anyway :-)

https://phabricator.wikimedia.org/T350686 can be used as a rough template (that was used to move Ganeti to the PKI) and https://wikitech.wikimedia.org/wiki/PKI/Clients (and https://wikitech.wikimedia.org/wiki/PKI in general) had some more docs. One issue that needs to be handled is the cert rollover. If the PKI issues a new cert we can't simply trigger mysqld restarts to pick up the new cert.

Nice finding Arnaud!

Let's create a separate task for switching Orchestrator to the PKI-issued cert, we're way beyond the initial scope of the current ask anyway :-)

+1.
Feel free to close this and we can follow up on that new task.

https://phabricator.wikimedia.org/T350686 can be used as a rough template (that was used to move Ganeti to the PKI) and https://wikitech.wikimedia.org/wiki/PKI/Clients (and https://wikitech.wikimedia.org/wiki/PKI in general) had some more docs. One issue that needs to be handled is the cert rollover. If the PKI issues a new cert we can't simply trigger mysqld restarts to pick up the new cert.

Good stuff!
Does this mean that we can decline T354411: Revert dbstore migration from puppet7 to puppet5 now, or would you still prefer that role to be migrated back to puppet 5?

Good stuff!
Does this mean that we can decline T354411: Revert dbstore migration from puppet7 to puppet5 now, or would you still prefer that role to be migrated back to puppet 5?

Let's leave stalled for now. If we can make orchestrator see dbstore hosts, we can decline it. If not, we'll need to revert to puppet 5.