Since generating certs for certain new network devices after Luca fixed our certificate chain issues with the sre.network.tls cookbook (see T355750), stats collection has unfortunately not worked as expected for all devices.
The three devices I'm aware of that had difficulties are below, but I've not checked them all:
ssw1-d8-codfw asw1-b3-magru asw1-b4-magru
The Spine in codfw caught my attention, as I wanted to get stats working for it which would be useful for the upcoming migration, but also because similar device ssw1-d1-codfw (same hardware, same JunOS version) started working ok once a cert was added for it.
Running gnmic manually in debug mode showed the following errors when it tried to connect and subscribe for stats:
2024/07/17 19:20:07.129021 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10 SubChannel #15] Subchannel picks a new address "ssw1-d8-codfw.mgmt.codfw.wmnet:32767" to connect
2024/07/17 19:20:07.129198 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] pickfirstBalancer: UpdateSubConnState: 0xc0004dfd70, {CONNECTING <nil>}
2024/07/17 19:20:07.129253 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10] Channel Connectivity change to CONNECTING
2024/07/17 19:20:07.162484 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10 SubChannel #15] Subchannel Connectivity change to READY
2024/07/17 19:20:07.162554 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] pickfirstBalancer: UpdateSubConnState: 0xc0004dfd70, {READY <nil>}
2024/07/17 19:20:07.162578 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10] Channel Connectivity change to READY
2024/07/17 19:20:07.209090 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [transport] [client-transport 0xc000e8d440] Closing: connection error: desc = "error reading from server: EOF"
2024/07/17 19:20:07.209578 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10 SubChannel #15] Subchannel Connectivity change to IDLE
2024/07/17 19:20:07.210044 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] pickfirstBalancer: UpdateSubConnState: 0xc0004dfd70, {IDLE <nil>}
2024/07/17 19:20:07.210109 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10] Channel Connectivity change to IDLE
2024/07/17 19:20:07.210236 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [transport] [client-transport 0xc000e8d440] loopyWriter exiting with error: transport closed by client
2024/07/17 19:20:07.210834 /home/runner/work/gnmic/gnmic/app/collector.go:111: [gnmic] target "ssw1-d8-codfw.mgmt.codfw.wmnet:32767": subscription interfaces-states rcv error: rpc error: code = Unavailable desc = error reading from server: EOF
2024/07/17 19:20:07.210930 /home/runner/work/gnmic/gnmic/app/collector.go:111: [gnmic] target "ssw1-d8-codfw.mgmt.codfw.wmnet:32767": subscription interfaces-states rcv error: retrying in 10sA tcpdump showed the TCP handshake was ok and there were several packets back and forth between the two devices. Thinking that perhaps the issue was related to the TLS handshake and certificates I tried to verify the cert on the switch was ok, which it appeared to be:
cmooney@netflow2003:/etc# openssl s_client -showcerts -connect ssw1-d8-codfw.mgmt.codfw.wmnet:32767 2>/dev/null | openssl x509 > /tmp/ssw1-d8.cert cmooney@netflow2003:/etc# cmooney@netflow2003:/etc# sudo openssl verify -verbose -show_chain -CAfile /etc/ssl/localcerts/network_devices_bundle.pem /tmp/ssw1-d8.cert /tmp/ssw1-d8.cert: OK Chain: depth=0: CN = ssw1-d8-codfw.mgmt.codfw.wmnet (untrusted) depth=1: C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = network_devices depth=2: C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
To troubleshoot more I instead attempted to establish a gnmic subscription from my laptop, with a similar config but passing "--skip-verify" to tell my local instance to not validate the TLS chain. This resulted in the same error. So it's not an issue with the certs/PKI etc.
I also checked and there was no issue with the user/password gnmic was set up to use.
I tried to reset the service by running delete system services extension-service, and while this did change the errors received to a more typical connection reset (i.e. service was no longer running), there was no change after re-enabling the service.
Troubleshooting more there was one difference I could see on the working vs. not working ones. Running this command on a working device showed the following output:
cmooney@ssw1-d1-codfw> show agent sensors
Sensor Information :
Name : sensor_1000
Resource : /junos/system/cmerror/configuration/
Version : 1.0
Sensor-id : 539528115
Subscription-ID : 1000
Parent-Sensor-Name : Not applicable
Component(s) : PFE
Profile Information :
Name : export_1000
Reporting-interval : 6
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1001
Resource : /junos/system/cmerror/counters/
Version : 1.0
Sensor-id : 539528114
Subscription-ID : 1001
Parent-Sensor-Name : Not applicable
Component(s) : PFE
Profile Information :
Name : export_1001
Reporting-interval : 6
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1002
Resource : /components/component/properties/property[name='state']/
Version : 1.0
Sensor-id : 539528113
Subscription-ID : 1002
Parent-Sensor-Name : Not applicable
Component(s) : chassisd
Profile Information :
Name : export_1002
Reporting-interval : 6
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1003
Resource : /junos/events/event[id='CHASSISD_SNMP_TRAP7']/
Version : 1.0
Sensor-id : 539528112
Subscription-ID : 1003
Parent-Sensor-Name : Not applicable
Component(s) : eventd
Profile Information :
Name : export_1003
Reporting-interval : 0
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1004
Resource : /interfaces/interface/state/
Version : 1.0
Sensor-id : 539528119
Subscription-ID : 1004
Parent-Sensor-Name : Not applicable
Component(s) : PFE,PFE,PFE,chassisd,mib2d,xmlproxyd
Profile Information :
Name : export_1004
Reporting-interval : 60
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1004_1_1
Resource : /junos/system/linecard/interface/queue/
Version : 1.0
Sensor-id : 3143454041
Subscription-ID : 1004
Parent-Sensor-Name : sensor_1004
Component(s) : PFE
Profile Information :
Name : export_1004
Reporting-interval : 60
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1004_1_2
Resource : /junos/system/linecard/interface/queue/extended-stats/
Version : 1.0
Sensor-id : 3143454042
Subscription-ID : 1004
Parent-Sensor-Name : sensor_1004
Component(s) : PFE
Profile Information :
Name : export_1004
Reporting-interval : 60
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1004_1_3
Resource : /junos/system/linecard/interface/traffic/
Version : 1.0
Sensor-id : 3143454043
Subscription-ID : 1004
Parent-Sensor-Name : sensor_1004
Component(s) : PFE
Profile Information :
Name : export_1004
Reporting-interval : 60
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1004_2_1
Resource : /interfaces/interface/state/
Version : 1.0
Sensor-id : 3143450969
Subscription-ID : 1004
Parent-Sensor-Name : sensor_1004
Component(s) : chassisd
Profile Information :
Name : export_1004
Reporting-interval : 60
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1004_3_1
Resource : /interfaces/interface/state/
Version : 1.0
Sensor-id : 3143451993
Subscription-ID : 1004
Parent-Sensor-Name : sensor_1004
Component(s) : mib2d
Profile Information :
Name : export_1004
Reporting-interval : 60
Payload-size : 5000
Format : GPB
Sensor Information :
Name : sensor_1004_4_1
Resource : /interfaces/interface/state/
Version : 1.0
Sensor-id : 3143457113
Subscription-ID : 1004
Parent-Sensor-Name : sensor_1004
Component(s) : xmlproxyd
Profile Information :
Name : export_1004
Reporting-interval : 60
Payload-size : 5000
Format : GPB
{master:0}Other switches not yet in state 'active', so with no collectors trying to subscribe to them, also showed the above. But on the devices that things were failing for the command returned nothing:
cmooney@ssw1-d8-codfw> show agent sensors
{master:0}
cmooney@ssw1-d8-codfw>The Juniper docs say this command shows "whether or not J-Insight has successfully subscribed to sensors on which it is dependent", and was a recommended troubleshooting step on some of their pages. The docs said it was a process called agentd, which when I checked from the shell did seem to be running:
% ps aux | grep agentd root 11221 0.0 0.7 495064 28176 - S 1Jul24 0:26.38 /usr/sbin/agentd -N
In the end, somewhat in desperation and aware of upcoming work that would make ssw1-d8-codfw a live device (and thus no more experimentation) I decided to reboot it. As soon as it came back up the full output showed for "show agent sensors", and gnmic was able to subscribe and collect stats without problem.
Probably the next step here is a TAC case with Juniper to try and get to the bottom of it. No doubt it's some bug that will require a disruptive upgrade so not great news, but let's see.