From page/alertmanager:
Icinga/Getent speed check 3 minutes ago instance: cloudstore1009 4 minutes ago instance: cloudstore1008 summary: CRITICAL: getent group tools.admin failed team: wmcs @cluster: wikimedia.org
From page/alertmanager:
Icinga/Getent speed check 3 minutes ago instance: cloudstore1009 4 minutes ago instance: cloudstore1008 summary: CRITICAL: getent group tools.admin failed team: wmcs @cluster: wikimedia.org
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | dcaro | T292202 Icinga/Getent speed check (cloudstore1008/1009) | |||
Open | None | T292205 monitor ldap functionality from within cloud-vps | |||
Open | None | T292208 Audit & standardize our use of ldap on metal servers |
It seems resolved now, but for a moment we were unable to login into VMs either (very brief)
Some logs from cloudstore1009 suggest ldap was down for a bit:
Sep 30 14:15:55 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:15:55 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:15:55 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:15:56 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:15:56 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:15:56 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:15:57 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:15:57 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:15:57 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:15:58 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:15:58 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:15:58 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:15:59 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:15:59 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:15:59 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:16:00 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:16:00 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:16:00 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:16:01 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:16:01 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:16:01 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:16:01 cloudstore1009 smartd[1889]: Device: /dev/bus/0 [megaraid_disk_03] [SAT], 32 Currently unreadable (pending) sectors Sep 30 14:16:01 cloudstore1009 smartd[1889]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 200 Sep 30 14:16:01 cloudstore1009 smartd[1889]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 200 Sep 30 14:16:02 cloudstore1009 systemd[1]: Started Regular job to collect puppet agent stats. Sep 30 14:16:02 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:16:02 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:16:02 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:16:03 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:5c:5e:ab:3d:87:c4:08:00 SRC=92.63.197.71 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=247 ID=60816 PROTO=TCP SPT=50169 DPT=3497 SEQ=4155865927 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Sep 30 14:16:03 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:16:03 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:16:03 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found, sleeping 1 seconds Sep 30 14:16:04 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:16:04 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:16:04 cloudstore1009 nslcd[2274]: [24140c] <group="tools.admin"> no available LDAP server found: Connect error Sep 30 14:16:12 cloudstore1009 sshd[3462]: Connection from 208.80.154.88 port 35792 on 208.80.155.126 port 22 Sep 30 14:16:12 cloudstore1009 sshd[3462]: Connection closed by 208.80.154.88 port 35792 [preauth] Sep 30 14:16:13 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:84:18:88:0d:df:c9:08:00 SRC=106.75.141.223 DST=208.80.155.126 LEN=44 TOS=00 PREC=0x00 TTL=239 ID=58761 PROTO=TCP SPT=58914 DPT=7779 SEQ=2875220116 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Sep 30 14:16:26 cloudstore1009 systemd[1]: Started Update NIC firmware stats exported by node_exporter. Sep 30 14:16:46 cloudstore1009 nslcd[2274]: [df1c20] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:16:46 cloudstore1009 nslcd[2274]: [df1c20] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:16:46 cloudstore1009 nslcd[2274]: [df1c20] <group="tools.admin"> no available LDAP server found: Connect error
Still having some issues:
Sep 30 14:20:21 cloudstore1009 nslcd[2274]: [c06a13] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:20:21 cloudstore1009 nslcd[2274]: [c06a13] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:20:21 cloudstore1009 nslcd[2274]: [c06a13] <group="tools.admin"> no available LDAP server found: Connect error Sep 30 14:20:24 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:5c:5e:ab:3d:87:c4:08:00 SRC=89.248.165.61 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=246 ID=33473 PROTO=TCP SPT=59041 DPT=54236 SEQ=3127757411 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Sep 30 14:20:26 cloudstore1009 systemd[1]: Started Update Debian version stat exported by node_exporter. Sep 30 14:20:28 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:84:18:88:0d:df:c9:08:00 SRC=79.124.62.234 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=248 ID=14066 PROTO=TCP SPT=54235 DPT=29614 SEQ=1846992418 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Sep 30 14:20:32 cloudstore1009 sudo[3680]: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_puppetrun -w 10800 -c 21600 Sep 30 14:20:32 cloudstore1009 sudo[3680]: pam_unix(sudo:session): session opened for user root by (uid=0) Sep 30 14:20:33 cloudstore1009 sudo[3680]: pam_unix(sudo:session): session closed for user root Sep 30 14:20:33 cloudstore1009 ulogd[1901]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:2d:94:3b:84:18:88:0d:df:c9:08:00 SRC=5.188.86.18 DST=208.80.155.126 LEN=40 TOS=00 PREC=0x00 TTL=249 ID=53886 PROTO=TCP SPT=8080 DPT=3038 SEQ=2281194206 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Sep 30 14:20:46 cloudstore1009 nslcd[2274]: [5474d4] <group="tools.admin"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) Sep 30 14:20:46 cloudstore1009 nslcd[2274]: [5474d4] <group="tools.admin"> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) Sep 30 14:20:46 cloudstore1009 nslcd[2274]: [5474d4] <group="tools.admin"> no available LDAP server found: Connect error
Here is what I think happened:
I restarted nslcd on all cloud hosts with
sudo cumin --force --timeout 500 -o json "A:all" "ps -ef | grep nslcd && service nslcd restart"
I'm about to do something similar for metal servers.
Mentioned in SAL (#wikimedia-cloud) [2021-09-30T14:43:02Z] <andrewbogott> ran sudo cumin --force --timeout 500 -o json "A:all" "ps -ef | grep nslcd && service nslcd restart" to get nslcd happy again T292202
Mentioned in SAL (#wikimedia-cloud) [2021-09-30T14:50:16Z] <andrewbogott> sudo cumin "cloud*" "ps -ef | grep nslcd && service nslcd restart" and sudo cumin "lab*" "ps -ef | grep nslcd && service nslcd restart" T292202
We didn't find the cause, other than the server misbehaving for whatever reason.
I think we have an actionable anyway:
Closing this as the alert is over, @aborrero suggestion sounds good to me, we can open a task for it if that's something we want to get done.