Page MenuHomePhabricator

Add two read-only LDAP servers in eqiad
Closed, ResolvedPublic

Description

We'd like to avoid latency from LDAP calls across datacenters, but we need to be able to failover LDAP to a secondary, in case of failure of a primary. Adding a second LDAP server per datacenter will solve this.

Current LDAP servers (running on Ganeti clusters outside Cloud VPS and managed cooperatively with Prod SREs):

  • seaborgium.wikimedia.org - EQIAD
  • serpens.wikimedia.org - CODFW

Details

Reference
bz44722

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:38 AM
bzimport added a project: Cloud-VPS.
bzimport set Reference to bz44722.
RyanLane created this task.Feb 6 2013, 5:07 PM
Aklapper removed RyanLane as the assignee of this task.Apr 26 2015, 12:11 PM
Krenair added a subscriber: Krenair.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 13 2015, 12:59 AM
GTirloni raised the priority of this task from Low to High.Mar 7 2019, 10:56 AM
GTirloni edited projects, added cloud-services-team (Kanban); removed Cloud-Services.

We're having constant LDAP issues now that load has been increasing on the servers (T217280)

It would be nice if we had LDAP replicas inside Cloud VPS for lower latency but also to spreadh the workload more. It seems OpenLDAP/slapd have some inherent issues that will be hard to fix properly and we might need to throw more HW resources into that (or look for a different LDAP implementation, deep dive into slapd source code, etc).

GTirloni updated the task description. (Show Details)Mar 7 2019, 12:20 PM
Paladox added a subscriber: Paladox.Mar 7 2019, 2:04 PM

It would be nice if we had LDAP replicas inside Cloud VPS for lower latency but also to spreadh the workload more. It seems OpenLDAP/slapd have some inherent issues that will be hard to fix properly and we might need to throw more HW resources into that (or look for a different LDAP implementation, deep dive into slapd source code, etc).

Can LDAP replication exclude password hashes etc. and just provide access to copy public information?

bd808 added a subscriber: bd808.Mar 7 2019, 10:06 PM

Can LDAP replication exclude password hashes etc. and just provide access to copy public information?

This would be a requirement of bringing a replica all the way into Cloud VPS address space I think. If the hashes come into the less trusted space then that would be problematic.

I have done some day dreaming before about configuring the existing LDAP directories to always fail authentication attempts to discourage Cloud VPS users from accidentally or intentionally collecting and validating passwords. If we could find a replicate the directory in read-only mode and exclude all the password hashes that would seem like a net win.

I'd like to set aside the issue of ldap-on-cloud for now and just get a couple more servers up on Ganeti. Which I don't immediately know how to do but I bet @GTirloni knows how

Note that the current setup is mirrormode for the LDAP servers and not N-Way Multi-Master[1] which does not support >2 LDAP servers in a write/write topology. If more servers are to be added as masters then a migration of the current puppetization should happen to support N-Way Multi-Master and of course testing and validation that this works. The reason this was not chosen back when the service was setup, was that the write access patterns did not justify investing into that. My numbers say that this still stands [2]

That being said, read access patterns seem to have changed, more applications talk to our LDAP servers currently and they are not always well written. To address this, openldap (as any decent LDAP implementation) does support read-only replicas, that have the capacity to defer clients that want to perform a write to a master.

Setting up 2 read only replicas behind an LVS service would be feasible (doing so for the masters is a no-go as they are in different DCs) and should increase the reliability of the service to the clients.

[1] https://www.openldap.org/doc/admin24/replication.html
[2] https://grafana.wikimedia.org/d/000000181/openldap-labs?panelId=2&fullscreen&orgId=1&from=now-30d&to=now points out exceptionally low rates for Add/Delete/Modify/Modrdn operations. Playing a bit with the queries (sum(openldap_monitored_op{dn=~"cn=(Add|Delete|Modify|Modrdn),cn=Operations,cn=Monitor"})) I get a total of 94 write requests over the course of 90 days

bd808 added a comment.Mar 13 2019, 3:28 PM

Setting up 2 read only replicas behind an LVS service would be feasible (doing so for the masters is a no-go as they are in different DCs) and should increase the reliability of the service to the clients.

All LDAP interactions coming from the Cloud VPS address space (172.16.0.0/21) should be read-only. In this environment we are primarily using LDAP as a lookup system for NSS data (/etc/passwd, /etc/group) with a secondary use case of managing authn/z for ssh to Cloud VPS instances (via ssh pub key storage & group membership) and sudoer data. Edits to the LDAP entries are always done from outside the Cloud VPS space.

If we can add 2+ read-only replicas in eqiad and point the Cloud VPS traffic to them we can a) isolate Cloud VPS traffic from other LDAP usage (Gerrit, Phabricator, HTTP Basic Auth), and b) scale the pool to accommodate disruption caused by the memory leak until we can get the leak fixed.

Andrew renamed this task from Have two LDAP servers per datacenter to Add two read-only LDAP servers in eqiad.Mar 13 2019, 3:39 PM

That sounds like a plan! I've re-titled this task to fit that plan. We'll need two more Ganeti VMs. The current ldap servers have 8 cores and 4Gb of RAM. More RAM would be great, and fewer CPUs would probably be tolerable.

Change 496241 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Initial site.pp entries for ldap-eqiad-replica01 and 02

https://gerrit.wikimedia.org/r/496241

Change 496241 merged by Andrew Bogott:
[operations/puppet@production] Initial site.pp entries for ldap-eqiad-replica01 and 02

https://gerrit.wikimedia.org/r/496241

Change 496245 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] install_server settings for ldap-eqiad-replica0[12].wikimedia.org

https://gerrit.wikimedia.org/r/496245

Change 496245 merged by Andrew Bogott:
[operations/puppet@production] install_server settings for ldap-eqiad-replica0[12].wikimedia.org

https://gerrit.wikimedia.org/r/496245

Change 496335 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] openldap: role/profile refactor for 'labs' and 'labtest' roles

https://gerrit.wikimedia.org/r/496335

Change 496335 merged by Andrew Bogott:
[operations/puppet@production] openldap: role/profile refactor for 'labs' and 'labtest' roles

https://gerrit.wikimedia.org/r/496335

Change 496503 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] openldap: make ldap-eqiad-replica01/02 ldap replicas

https://gerrit.wikimedia.org/r/496503

Change 496552 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] openldap: add read_only switch for ldap servers

https://gerrit.wikimedia.org/r/496552

Change 496552 merged by Andrew Bogott:
[operations/puppet@production] openldap: add read_only switch for ldap servers

https://gerrit.wikimedia.org/r/496552

Change 496503 merged by Andrew Bogott:
[operations/puppet@production] openldap: make ldap-eqiad-replica01/02 ldap replicas

https://gerrit.wikimedia.org/r/496503

Change 496615 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] ldap: added certs for ldap-ro

https://gerrit.wikimedia.org/r/496615

Change 496615 merged by Andrew Bogott:
[operations/puppet@production] ldap: added certs for ldap-ro

https://gerrit.wikimedia.org/r/496615

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM
Andrew closed this task as Resolved.Mar 22 2019, 3:47 AM
Andrew claimed this task.

ldap-eqiad-replica01.wikimedia.org and ldap-eqiad-replica02.wikimedia.org are online now and seem to be working

Andrew reopened this task as Open.Mar 22 2019, 3:55 AM

this might need a bit of indexing work -- I see this in syslog:

bdb_equality_candidates: (sudoHost) not indexed

Change 498396 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] ldap: add an index for 'sudoHost'

https://gerrit.wikimedia.org/r/498396

Change 498439 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] ldap replicas: enable alerting

https://gerrit.wikimedia.org/r/498439

Change 498439 merged by Andrew Bogott:
[operations/puppet@production] ldap replicas: enable alerting

https://gerrit.wikimedia.org/r/498439

Change 498396 merged by Andrew Bogott:
[operations/puppet@production] ldap: add an index for 'sudoHost'

https://gerrit.wikimedia.org/r/498396

Andrew closed this task as Resolved.Mar 25 2019, 2:37 PM

This is done. There are now two read-only replicas:

ldap-eqiad-replica01
ldap-eqiad-replica02

they're behind an lvs endpoint, ldap-ro.eqiad.wikimedia.org. Toolforge hosts are using that endpoint and seem happier.

Change 521518 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud-vps hiera: move more ldap lookups to the ro replicas

https://gerrit.wikimedia.org/r/521518

Change 521518 merged by Andrew Bogott:
[operations/puppet@production] Move all cloud VMs to the read-only ldap replicas

https://gerrit.wikimedia.org/r/521518

Dzahn added a subscriber: Dzahn.Tue, Jul 9, 11:26 PM

please see https://phabricator.wikimedia.org/T224110#5319798

I can't make LDAP changes anymore and it took us a while to see what's going on. How/where do we make changes now please?

Change 524201 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Switch profile::openldap::management to obtain the LDAP server from Hiera

https://gerrit.wikimedia.org/r/524201

Change 524201 merged by Muehlenhoff:
[operations/puppet@production] Switch profile::openldap::management to obtain the LDAP server from Hiera

https://gerrit.wikimedia.org/r/524201