Page MenuHomePhabricator

gNMIc connection not working for cloudsw2-d5-eqiad
Closed, ResolvedPublic

Description

For some reason gnmic is unable to connect to cloudsw2-d5-eqiad. It appears it has never been able to do so.

gnmic if run in debug mode has these error logs repeating when it tries to connect:

2025/02/21 13:04:35.338810 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/grpclog/logger.go:53: [gnmic] [core] [Channel #23 SubChannel #42]Subchannel picks a new address "cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767" to connect

2025/02/21 13:04:35.343628 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/grpclog/logger.go:53: [gnmic] [core] Creating new client transport to "{Addr: \"cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767\", ServerName: \"cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767\", }": connection error: desc = "transport: authentication handshake failed: EOF"

2025/02/21 13:04:35.343693 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/grpclog/logger.go:65: [gnmic] [core] [Channel #23 SubChannel #42]grpc: addrConn.createTransport failed to connect to {Addr: "cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767", ServerName: "cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767", }. Err: connection error: desc = "transport: authentication handshake failed: EOF"

2025/02/21 13:04:35.343980 /home/runner/work/gnmic/gnmic/pkg/app/collector.go:123: [gnmic] target "cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767": subscription bgp rcv error: failed to create a subscribe client, target='cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767', retry in 10000000000. err=rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: EOF"

2025/02/21 13:04:35.344105 /home/runner/work/gnmic/gnmic/pkg/app/collector.go:123: [gnmic] target "cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767": subscription interfaces-states rcv error: failed to create a subscribe client, target='cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767', retry in 10000000000. err=rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: EOF"

If I try to connect to it using openssl to validate the cert is ok the TCP connection opens but it fails after that:

cmooney@cumin1002:~$ openssl s_client -showcerts -connect cloudsw2-d5-eqiad.mgmt.eqiad.wmnet:32767 
CONNECTED(00000003)
write:errno=0
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 326 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---

However manually taking the cert file off it and checking with openssl I can see it is valid (re-generated a few days back in testing):

cmooney@cumin1002:~$ openssl x509 -in cloudsw2.crt -noout -dates 
notBefore=Feb 13 16:49:00 2025 GMT
notAfter=Feb 13 16:49:00 2026 GMT

I can't spot any config difference on it either. Looking at the JunOS versions on all the cloudsw it is the oldest, though only slightly behind E4/F4 which are working:

20.2R2-S3.5  cloudsw2-d5-eqiad
20.4R3.8     cloudsw1-e4-eqiad
20.4R3.8     cloudsw1-f4-eqiad
21.4R3.16    cloudsw1-c8-eqiad
21.4R3.16    cloudsw1-d5-eqiad
22.2R3.15    cloudsw1-b1-codfw

@ayounsi any ideas on what to try here? If nothing jumps out at us a JunOS upgrade is unfortunately what we need.

Related Objects

Event Timeline

cmooney triaged this task as Low priority.

Mentioned in SAL (#wikimedia-operations) [2025-02-24T09:24:17Z] <XioNoX> cloudsw2-d5-eqiad> restart analytics-agent gracefully - T387018

The switch is running a too old junos version for analytics-agent. I tried cloudsw2-d5-eqiad> restart SDN-Telemetry gracefully instead, but that didn't work.

The switch is running a too old junos version for analytics-agent. I tried cloudsw2-d5-eqiad> restart SDN-Telemetry gracefully instead, but that didn't work.

Makes sense thanks for confirming. I'll talk to the cloud team see when we might be able to upgrade.

Enabling traceoptions shows a no shared cipher error on the switch :

Feb 24 09:33:58 ssl_transport_security.c:948: Handshake failed with fatal error SSL_ERROR_SSL: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher.
Feb 24 09:33:58 chttp2_server.c:83: Handshaking failed: {"created":"@1740389638.118250387","description":"Handshake failed","file":"../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_handshaker.c","file_line":276,"tsi_code":10,"tsi_error":"TSI_PROTOCOL_FAILURE"}
Feb 24 09:34:02 ssl_transport_security.c:948: Handshake failed with fatal error SSL_ERROR_SSL: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher.
Feb 24 09:34:02 chttp2_server.c:83: Handshaking failed: {"created":"@1740389642.111017067","description":"Handshake failed","file":"../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_handshaker.c","file_line":276,"tsi_code":10,"tsi_error":"TSI_PROTOCOL_FAILURE"}

Using the script on https://superuser.com/questions/109213/how-do-i-list-the-ssl-tls-cipher-suites-a-particular-website-offers shows no common cipher, unlike for example cloudsw1-d5 which shows at least ECDHE-ECDSA-AES256-GCM-SHA384 (I stopped at the first match).

Upgrade is most likely the best path forward, not sure if we can (of it it's worth to) "play" with gNMIc's cipher-suites or tls-min-version (see https://gnmic.openconfig.net/user_guide/targets/targets/#controlling-the-advertised-cipher-suites)

Enabling traceoptions shows a no shared cipher error on the switch :

Feb 24 09:33:58 ssl_transport_security.c:948: Handshake failed with fatal error SSL_ERROR_SSL: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher.

Ah nice find.

Using the script on https://superuser.com/questions/109213/how-do-i-list-the-ssl-tls-cipher-suites-a-particular-website-offers shows no common cipher, unlike for example cloudsw1-d5 which shows at least ECDHE-ECDSA-AES256-GCM-SHA384 (I stopped at the first match).

Yeah I ran that myself and the results are odd. I suspect some issue/bug with the TLS implementation on that JunOS version, every single openssl cipher check reports an error (similar to the one in the task description) if you try to connect manually. Even "AES128-SHA" or ones you figure an old version should definitely support.

Upgrade is most likely the best path forward, not sure if we can (of it it's worth to) "play" with gNMIc's cipher-suites or tls-min-version (see https://gnmic.openconfig.net/user_guide/targets/targets/#controlling-the-advertised-cipher-suites)

Yeah my read of this is it's not just a matter of compatible ciphers, there is a bug or issue with the TLS implementation in this JunOS.

ayounsi claimed this task.

cloudsw2-d5-eqiad is now gone.