Page MenuHomePhabricator

Investigate failover failure of LDAP servers
Closed, ResolvedPublic

Description

When seaborgium died, most things (nscd, etc) did not failover appropriately to serpens. Investigate why, and also make a list of things that did not failover properly.

Event Timeline

There is no failover in that sense, various LDAP clients allow to use multiple servers and depending on their configuration they may use round-robin access or use a primary server with varying methods of switching to a second one. The failover itself is perfectly fine for all clients in labs, I have restarted slapd many times for all kinds of updates and the Grafana dashboard always showed that connections move to the secondary server. My guess is that the slapd OOM makes slapd still accept a connection, but then the worker thread to serve the request cannot be spawned due to a lack of memory.

@Andrew and I were discussing whether LVS would make sense in front of LDAP with the ability to more intelligently depool/handle complex failure cases.

@Andrew and I were discussing whether LVS would make sense in front of LDAP with the ability to more intelligently depool/handle complex failure cases.

LVS is probably doable, but will have it's own share of tricky problems (e.g. SASL). But as mentioned before; the load balancing itself is working fine except the incident on Friday. Historically right after we migrated from opendj we fixed the client configs to use both slapds and between that and now I restarted openldap at least 15-20 times each (for kernel updates, openldap upgrades, restarts for updated libs etc.) without any user impact.

I wasn't around on Friday and it's not fully clear to me what went wrong. Daniel mentioned in the Ops meeting that one of the VMs was also stuck on the Ganeti level? Are we really sure we don't mix two different problems here?

I think with the interim kludge of weekly updates (and with an eventual code fix in openldap) we won't see that problem again, this was very likely just fallout of the OOM and does not happen for unavailability of slapd in general. If anyone wants to test, just stop slapd on one of the servers for a few minutes and the openldap-labs dashboard will show the connections moving to the secondary host.

If what happened the other day was that one VM was overloaded and stopped answering ldap queries while still accepting connections (which given the described procedure seems like a likely explanation), having LVS check LDAP and depool one server would have effectively reset all connections from clients to it and solved the user-facing issue; it adds a certail level of complexity though, and I'd take the other approach: restart the openLDAP service smartly while we find the source of the memleak.

Dependent task T130593 has had no update since Nov 2016, so this is probably solved. I am gonna resolve this, feel free to reopen

akosiaris claimed this task.