Page MenuHomePhabricator

Put our ldap servers behind LVS
Closed, ResolvedPublic

Description

T217280 has uncovered a fair number of sub-issues. One of the most pressing ones is that sometimes when an ldap server restarts, the grid engine node using that server freaks out and gets depooled.

As far as I can tell, the traditional way to provide redundancy for ldap is on the client side -- ldap.conf contains urls for multiple ldap servers and the client is meant to deal with fail-overs. Experience (in the grid engine and elsewhere) shows that this doesn't actually work very well... it only fails over after time outs and errors and other messes.

So, let's take this out of the clients' hands and put all ldap access behind a single service name and service IP. Then if we need to keep restarting ldap servers due to the memory leak, that instability will be less obvious to clients.

Event Timeline

Change 496007 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Service name and IPs for ldap-behind-lvs

https://gerrit.wikimedia.org/r/496007

Change 496065 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add an lvs service in front of our two ldap servers

https://gerrit.wikimedia.org/r/496065

Current plan is to add two new read-only hosts (on internal IPs) and put LVS in front of them, then use that endpoint exclusively for cloud VMs access.

Change 496007 merged by Andrew Bogott:
[operations/dns@master] Service name and IPs for ldap-behind-lvs

https://gerrit.wikimedia.org/r/496007

Change 496065 merged by Andrew Bogott:
[operations/puppet@production] Add lvs to the read-only ldap replicas

https://gerrit.wikimedia.org/r/496065

Change 496858 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add lvs to the read-only ldap replicas

https://gerrit.wikimedia.org/r/496858

Change 496858 merged by Andrew Bogott:
[operations/puppet@production] Add lvs to the read-only ldap replicas

https://gerrit.wikimedia.org/r/496858

There are now two read-only replicas in eqiad behind the endpoint ldap-ro.eqiad.wikimedia.org

Change 498343 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] lvs: fix ldap-ro and ldap-ro-ssl depool thresholds

https://gerrit.wikimedia.org/r/498343

Change 498343 merged by Ema:
[operations/puppet@production] lvs: fix ldap-ro and ldap-ro-ssl depool thresholds

https://gerrit.wikimedia.org/r/498343

Mentioned in SAL (#wikimedia-operations) [2019-03-22T11:18:51Z] <ema> lvs1005: bounce pybal to clear backends health icinga warning T218133

Mentioned in SAL (#wikimedia-operations) [2019-03-22T11:22:07Z] <ema> lvs1002: bounce pybal to clear backends health icinga warning T218133

This is running, and working OK. Our anti-memory-leak cron is still firing pretty often; maybe on the replicas it can depool before killing to prevent clients from getting unexpected disconnects...

Andrew claimed this task.

I can't remember why I didn't close this before.