Page MenuHomePhabricator

[openstack] LDAP is broken in codfw
Closed, ResolvedPublic

Description

slapd is running in cloudservices2004-dev and cloudservices2005-dev, and is configured to sync information bidirectionally between the two servers.

Looking at the slapd logs (journalctl -u slapd) it looks like synchronization was broken from before I reimaged 2004 to bookworm (reimage happened on Sep 18):

Sep 07 13:13:33 cloudservices2005-dev slapd[2400376]: slap_client_connect: URI=ldap://cloudservices2004-dev.codfw.wmnet:389 Error, ldap_start_tls failed (-11)

Disabling TLS in /etc/ldap/slapd.conf (starttls=no) results in a different error that seems to indicate invalid credentials:

Sep 28 09:22:46 cloudservices2005-dev slapd[3980547]: slap_client_connect: URI=ldap://cloudservices2004-dev.codfw.wmnet:389 DN="cn=repluser,dc=wikimedia,dc=org" ldap_sasl_bind_s failed (49)

Event Timeline

fnegri changed the task status from Open to In Progress.Sep 28 2023, 9:27 AM
fnegri created this task.

We may want to check the certificate in use, it may be missing some alt-name or similar. It should be in ops/puppet, maybe using acmechief, which doesn't make a lot of sense for that .codfw.wmnet address.

Maybe they should be contacting each other using the .private.codfw.wikimedia.cloud address.

In hieradata/role/common/acme_chief.yaml we have:

ldap-codfw1dev:
    CN: 'ns-recursor.openstack.codfw1dev.wikimediacloud.org'
    SNI:
    - 'ns0.openstack.codfw1dev.wikimediacloud.org'
    - 'ns1.openstack.codfw1dev.wikimediacloud.org'
    - 'cloudservices2004-dev.private.codfw.wikimedia.cloud'
    - 'cloudservices2005-dev.private.codfw.wikimedia.cloud'
    challenge: dns-01
    authorized_hosts:
    - 'cloudservices2004-dev.codfw.wmnet'
    - 'cloudservices2005-dev.codfw.wmnet'

Change 961780 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices[2004,2005]-dev: refresh their counterpart FQDN

https://gerrit.wikimedia.org/r/961780

Change 961780 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices[2004,2005]-dev: refresh their counterpart FQDN

https://gerrit.wikimedia.org/r/961780

The patch seems to solve the TLS problem.

Now it seems we are missing something related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/961066

I'll let @Andrew handle that one.

fnegri triaged this task as High priority.Sep 28 2023, 1:12 PM

Change 961843 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] codfw1dev: disable cloudservices2004-dev LDAP server

https://gerrit.wikimedia.org/r/961843

This is now resolved. I rebuilt cloudservices2005-dev with bookworm and slapd seems to be working there too.