Page MenuHomePhabricator

Audit the labs infrastructure scripts that depend on LDAP to make sure they are resilient to failover
Closed, ResolvedPublic

Description

Since we're restarting them often now, we need to make sure failover + failure handling works properly.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 303565 had a related patch set uploaded (by Yuvipanda):
labstore: Configure LDAP failover timeout for create-dbuser

https://gerrit.wikimedia.org/r/303565

Change 303565 merged by Yuvipanda:
labstore: Configure LDAP failover timeout for create-dbuser

https://gerrit.wikimedia.org/r/303565

Change 303607 had a related patch set uploaded (by Yuvipanda):
tools: Use LDAP servers in HA manner for maintain-kubeusers

https://gerrit.wikimedia.org/r/303607

Change 303607 merged by Yuvipanda:
tools: Use LDAP servers in HA manner for maintain-kubeusers

https://gerrit.wikimedia.org/r/303607

maintain-kubeusers was stuck connecting to seaborgium today for minutes, despite there being a 1s connection timeout :(

maintain- 3414 root    4u  IPv4           10668954      0t0     TCP tools-k8s-master-01.tools.eqiad.wmflabs:33368->seaborgium.wikimedia.org:ldap (CLOSE_WAIT)
maintain- 3414 root    5u  IPv4           10670197      0t0     TCP tools-k8s-master-01.tools.eqiad.wmflabs:33432->seaborgium.wikimedia.org:ldap (CLOSE_WAIT)
maintain- 3414 root    7u  IPv4           10668828      0t0     TCP tools-k8s-master-01.tools.eqiad.wmflabs:33302->seaborgium.wikimedia.org:ldap (CLOSE_WAIT)

Adding a time_limit=N to conn.search in get_tools_from_ldap might help. The socket was connected and in CLOSE_WAIT according to what @yuvipanda said on irc, so the 1s connect limit wouldn't apply.

time_limit: number of seconds allowed for the search (defaults to None). > If None the search can take an unlimited amount of time, unless the server has a more restrictive rule. -- http://ldap3.readthedocs.io/searches.html

Change 305616 had a related patch set uploaded (by Yuvipanda):
labs: Set timeout for ldap3 using scripts

https://gerrit.wikimedia.org/r/305616

Change 305616 merged by Yuvipanda:
labs: Set timeout for ldap3 using scripts

https://gerrit.wikimedia.org/r/305616

Bstorm claimed this task.
Bstorm subscribed.

resolved by load balancing