Page MenuHomePhabricator

Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate"
Open, MediumPublic

Description

This is a well-known issue (e.g. https://puppet.atlassian.net/browse/SERVER-2518) and currently the centrallog certs are affected (golang-side issue, with comments from @jbond too! https://github.com/golang/go/issues/31440)

Opening this task for tracking, if we can bandaid it in the meantime that'd be great I think. The proverbial nail in the coffin is T324623: Switch rsyslog from gtls to ossl though

Event Timeline

@fgiunchedi what is the probing software? we do have a bit of a work around for this which may work here as well. also if it is the issue you mention I'm not sure that switching to ossl will help. however i suspect T347565: Switch rsyslog to use the new PKI infrastructure would

The software in this case is prometheus blackbox exporter @jbond. AFAICT ossl doesn't suffer from this problem though I might be wrong as I've only glanced at the issue!

Change 975791 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] base: switch rsyslog tls_netstream_driver to ossl

https://gerrit.wikimedia.org/r/975791

The software in this case is prometheus blackbox exporter @jbond. AFAICT ossl doesn't suffer from this problem though I might be wrong as I've only glanced at the issue!

When i have seen this issue before it has been from the golang client. I suspect the issue is that now the centrallog servers have been switched to puppet7 and are using the new CA the so theses probes are failing. My assumption is that theses probes are only checking the centrallog servers (not clients) and i didn't think we had any of them using ossl? If we are seeing this fixed by using ossl then id defiantly be curious to dig into that as its a different variant to what we have seen so far.

Anyway all that said it seems like it should be fairly simple to switch everything to ossl now. ill add a comment on T324623

LSobanski subscribed.

Removing collaboration-services as I don't see any clear activity for us here.

Change 975861 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] centrallog: update tls_netstream_driver to use ossl

https://gerrit.wikimedia.org/r/975861

Change 975791 merged by Jbond:

[operations/puppet@production] base: switch rsyslog tls_netstream_driver to ossl

https://gerrit.wikimedia.org/r/975791

Change 975861 merged by Jbond:

[operations/puppet@production] centrallog: update tls_netstream_driver to use ossl

https://gerrit.wikimedia.org/r/975861

@fgiunchedi Everything is using openssl now, do you still see the errors?

Yes I still see the errors:

Nov 21, 2023 @ 14:45:47.621	prometheus1005	target=[2620:0:861:102:10:64:16:86]:6514 msg="Error dialing TCP" err="x509: issuer name does not match subject from issuing certificate"

Change 976267 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] pki: add mtls profile

https://gerrit.wikimedia.org/r/976267

Change 976267 merged by Jbond:

[operations/puppet@production] pki: add mtls profile

https://gerrit.wikimedia.org/r/976267

Change 976273 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] prometheus: update to request testing certs from pki

https://gerrit.wikimedia.org/r/976273

@fgiunchedi i have created a CR to use pki.discovery.wmnet to request a puppet agent certificate instead of using expos_puppet_certs. this should work around the issue

jbond triaged this task as Medium priority.Nov 21 2023, 4:53 PM

Change 976273 merged by Jbond:

[operations/puppet@production] prometheus: update to request testing certs from pki

https://gerrit.wikimedia.org/r/976273

Change 976575 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/puppet@production] prometheus: update to request testing certs from pki

https://gerrit.wikimedia.org/r/976575

@fgiunchedi i have created a CR to use pki.discovery.wmnet to request a puppet agent certificate instead of using expos_puppet_certs. this should work around the issue

This didn't fix the issue. I think we will need to pursue T347565: Switch rsyslog to use the new PKI infrastructure. however we still may need a similar change to this as well