Page MenuHomePhabricator

Icinga/Getent speed check (cloudstore1008/1009)
Closed, ResolvedPublic

Description

From page/alertmanager:

Icinga/Getent speed check

3 minutes ago instance: cloudstore1009
4 minutes ago instance: cloudstore1008

summary: CRITICAL: getent group tools.admin failed
team: wmcs
@cluster: wikimedia.org

Event Timeline

dcaro triaged this task as High priority.Sep 30 2021, 2:12 PM
dcaro created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It seems resolved now, but for a moment we were unable to login into VMs either (very brief)

Some logs from cloudstore1009 suggest ldap was down for a bit:

Sep 30 14:15:55 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:15:55 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:15:55 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:15:56 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:15:56 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:15:56 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:15:57 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:15:57 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:15:57 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:15:58 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:15:58 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:15:58 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:15:59 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:15:59 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:15:59 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:16:00 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:16:00 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:16:00 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:16:01 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:16:01 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:16:01 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:16:01 cloudstore1009 smartd[1889]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], 32 Currently unreadable (pending) sectors
Sep 30 14:16:01 cloudstore1009 smartd[1889]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 200
Sep 30 14:16:01 cloudstore1009 smartd[1889]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 200
Sep 30 14:16:02 cloudstore1009 systemd[1]: Started Regular job to collect puppet agent stats.
Sep 30 14:16:02 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:16:02 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:16:02 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:16:03 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:5c:5e:ab:3d:87:c4:08:00 SRC=92.63.197.71 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=247 ID=60816 PROTO=TCP SPT=50169 DPT=3497 SEQ=4155865927 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Sep 30 14:16:03 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:16:03 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:16:03 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds
Sep 30 14:16:04 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:16:04 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:16:04 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found: Connect error
Sep 30 14:16:12 cloudstore1009 sshd[3462]: Connection from 208.80.154.88 port 35792 on 208.80.155.126 port 22
Sep 30 14:16:12 cloudstore1009 sshd[3462]: Connection closed by 208.80.154.88 port 35792 [preauth]
Sep 30 14:16:13 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:84:18:88:0d:df:c9:08:00 SRC=106.75.141.223 DST=208.80.155.126 LEN=44 TOS=00 PREC=0x00 TTL=239 ID=58761 PROTO=TCP SPT=58914 DPT=7779 SEQ=2875220116 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Sep 30 14:16:26 cloudstore1009 systemd[1]: Started Update NIC firmware stats exported by node_exporter.
Sep 30 14:16:46 cloudstore1009 nslcd[2274]: [df1c20] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:16:46 cloudstore1009 nslcd[2274]: [df1c20] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:16:46 cloudstore1009 nslcd[2274]: [df1c20] <group="tools.admin"> no available LDAP server found: Connect error

Still having some issues:

Sep 30 14:20:21 cloudstore1009 nslcd[2274]: [c06a13] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:20:21 cloudstore1009 nslcd[2274]: [c06a13] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:20:21 cloudstore1009 nslcd[2274]: [c06a13] <group="tools.admin"> no available LDAP server found: Connect error
Sep 30 14:20:24 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:5c:5e:ab:3d:87:c4:08:00 SRC=89.248.165.61 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=246 ID=33473 PROTO=TCP SPT=59041 DPT=54236 SEQ=3127757411 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Sep 30 14:20:26 cloudstore1009 systemd[1]: Started Update Debian version stat exported by node_exporter.
Sep 30 14:20:28 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:84:18:88:0d:df:c9:08:00 SRC=79.124.62.234 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=248 ID=14066 PROTO=TCP SPT=54235 DPT=29614 SEQ=1846992418 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Sep 30 14:20:32 cloudstore1009 sudo[3680]:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_puppetrun -w 10800 -c 21600
Sep 30 14:20:32 cloudstore1009 sudo[3680]: pam_unix(sudo:session): session opened for user root by (uid=0)
Sep 30 14:20:33 cloudstore1009 sudo[3680]: pam_unix(sudo:session): session closed for user root
Sep 30 14:20:33 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:84:18:88:0d:df:c9:08:00 SRC=5.188.86.18 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=249 ID=53886 PROTO=TCP SPT=8080 DPT=3038 SEQ=2281194206 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Sep 30 14:20:46 cloudstore1009 nslcd[2274]: [5474d4] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code)
Sep 30 14:20:46 cloudstore1009 nslcd[2274]: [5474d4] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
Sep 30 14:20:46 cloudstore1009 nslcd[2274]: [5474d4] <group="tools.admin"> no available LDAP server found: Connect error

Here is what I think happened:

  1. ldap certs rotated at 13:00 UTC
  2. as nslcd caches expired, nslcd found itself unable to reconnect. That resulted in the steady drop of open ldap connections that we saw on grafana.
  3. ldap broke gradually over the course of an hour or so on all hosts using nslcd (e.g. bastion-eqiad1-01 in the cloud as well as cloudstores)
  4. sssd handled the rotation just fine, so no symptoms appeared on newer/sssd-using hosts

I restarted nslcd on all cloud hosts with

sudo cumin --force --timeout 500 -o json  "A:all" "ps -ef | grep nslcd && service nslcd restart"

I'm about to do something similar for metal servers.

Mentioned in SAL (#wikimedia-cloud) [2021-09-30T14:43:02Z] <andrewbogott> ran sudo cumin --force --timeout 500 -o json "A:all" "ps -ef | grep nslcd && service nslcd restart" to get nslcd happy again T292202

Mentioned in SAL (#wikimedia-cloud) [2021-09-30T14:50:16Z] <andrewbogott> sudo cumin "cloud*" "ps -ef | grep nslcd && service nslcd restart" and sudo cumin "lab*" "ps -ef | grep nslcd && service nslcd restart" T292202

We didn't find the cause, other than the server misbehaving for whatever reason.

I think we have an actionable anyway:

  • in each LDAP client add secondary codfw LDAP (if that doesn't violate some cross-DC network policy)
    • ldap-ro.codfw.wikimedia.org
    • ldap-labs.codfw.wikimedia.org

Closing this as the alert is over, @aborrero suggestion sounds good to me, we can open a task for it if that's something we want to get done.