Page MenuHomePhabricator

Issue creating GNMI telemetry subscription to certain QFX5120 devices
Closed, ResolvedPublic

Description

Since generating certs for certain new network devices after Luca fixed our certificate chain issues with the sre.network.tls cookbook (see T355750), stats collection has unfortunately not worked as expected for all devices.

The three devices I'm aware of that had difficulties are below, but I've not checked them all:

ssw1-d8-codfw
asw1-b3-magru
asw1-b4-magru

The Spine in codfw caught my attention, as I wanted to get stats working for it which would be useful for the upcoming migration, but also because similar device ssw1-d1-codfw (same hardware, same JunOS version) started working ok once a cert was added for it.

Running gnmic manually in debug mode showed the following errors when it tried to connect and subscribe for stats:

2024/07/17 19:20:07.129021 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10 SubChannel #15] Subchannel picks a new address "ssw1-d8-codfw.mgmt.codfw.wmnet:32767" to connect
2024/07/17 19:20:07.129198 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] pickfirstBalancer: UpdateSubConnState: 0xc0004dfd70, {CONNECTING <nil>}
2024/07/17 19:20:07.129253 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10] Channel Connectivity change to CONNECTING
2024/07/17 19:20:07.162484 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10 SubChannel #15] Subchannel Connectivity change to READY
2024/07/17 19:20:07.162554 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] pickfirstBalancer: UpdateSubConnState: 0xc0004dfd70, {READY <nil>}
2024/07/17 19:20:07.162578 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10] Channel Connectivity change to READY
2024/07/17 19:20:07.209090 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [transport] [client-transport 0xc000e8d440] Closing: connection error: desc = "error reading from server: EOF"
2024/07/17 19:20:07.209578 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10 SubChannel #15] Subchannel Connectivity change to IDLE
2024/07/17 19:20:07.210044 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] pickfirstBalancer: UpdateSubConnState: 0xc0004dfd70, {IDLE <nil>}
2024/07/17 19:20:07.210109 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [core] [Channel #10] Channel Connectivity change to IDLE
2024/07/17 19:20:07.210236 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.56.1/grpclog/logger.go:53: [gnmic] [transport] [client-transport 0xc000e8d440] loopyWriter exiting with error: transport closed by client
2024/07/17 19:20:07.210834 /home/runner/work/gnmic/gnmic/app/collector.go:111: [gnmic] target "ssw1-d8-codfw.mgmt.codfw.wmnet:32767": subscription interfaces-states rcv error: rpc error: code = Unavailable desc = error reading from server: EOF
2024/07/17 19:20:07.210930 /home/runner/work/gnmic/gnmic/app/collector.go:111: [gnmic] target "ssw1-d8-codfw.mgmt.codfw.wmnet:32767": subscription interfaces-states rcv error: retrying in 10s

A tcpdump showed the TCP handshake was ok and there were several packets back and forth between the two devices. Thinking that perhaps the issue was related to the TLS handshake and certificates I tried to verify the cert on the switch was ok, which it appeared to be:

cmooney@netflow2003:/etc# openssl s_client -showcerts -connect ssw1-d8-codfw.mgmt.codfw.wmnet:32767 2>/dev/null | openssl x509 > /tmp/ssw1-d8.cert 
cmooney@netflow2003:/etc#
cmooney@netflow2003:/etc# sudo openssl verify -verbose -show_chain -CAfile /etc/ssl/localcerts/network_devices_bundle.pem /tmp/ssw1-d8.cert 
/tmp/ssw1-d8.cert: OK
Chain:
depth=0: CN = ssw1-d8-codfw.mgmt.codfw.wmnet (untrusted)
depth=1: C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = network_devices
depth=2: C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA

To troubleshoot more I instead attempted to establish a gnmic subscription from my laptop, with a similar config but passing "--skip-verify" to tell my local instance to not validate the TLS chain. This resulted in the same error. So it's not an issue with the certs/PKI etc.

I also checked and there was no issue with the user/password gnmic was set up to use.

I tried to reset the service by running delete system services extension-service, and while this did change the errors received to a more typical connection reset (i.e. service was no longer running), there was no change after re-enabling the service.

Troubleshooting more there was one difference I could see on the working vs. not working ones. Running this command on a working device showed the following output:

cmooney@ssw1-d1-codfw> show agent sensors   

Sensor Information : 
           
    Name                                    : sensor_1000           
    Resource                                : /junos/system/cmerror/configuration/ 
    Version                                 : 1.0                  
    Sensor-id                               : 539528115             
    Subscription-ID                         : 1000                 
    Parent-Sensor-Name                      : Not applicable       
    Component(s)                            : PFE                   

    Profile Information : 
           
        Name                                : export_1000           
        Reporting-interval                  : 6                     
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1001           
    Resource                                : /junos/system/cmerror/counters/ 
    Version                                 : 1.0                  
    Sensor-id                               : 539528114             
    Subscription-ID                         : 1001                 
    Parent-Sensor-Name                      : Not applicable       
    Component(s)                            : PFE                   

    Profile Information : 
           
        Name                                : export_1001           
        Reporting-interval                  : 6                     
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1002           
    Resource                                : /components/component/properties/property[name='state']/ 
    Version                                 : 1.0                  
    Sensor-id                               : 539528113             
    Subscription-ID                         : 1002                 
    Parent-Sensor-Name                      : Not applicable       
    Component(s)                            : chassisd              

    Profile Information : 
           
        Name                                : export_1002           
        Reporting-interval                  : 6                     
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1003           
    Resource                                : /junos/events/event[id='CHASSISD_SNMP_TRAP7']/ 
    Version                                 : 1.0                  
    Sensor-id                               : 539528112             
    Subscription-ID                         : 1003                 
    Parent-Sensor-Name                      : Not applicable       
    Component(s)                            : eventd                

    Profile Information : 
           
        Name                                : export_1003           
        Reporting-interval                  : 0                     
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1004           
    Resource                                : /interfaces/interface/state/ 
    Version                                 : 1.0                  
    Sensor-id                               : 539528119             
    Subscription-ID                         : 1004                 
    Parent-Sensor-Name                      : Not applicable       
    Component(s)                            : PFE,PFE,PFE,chassisd,mib2d,xmlproxyd 

    Profile Information : 
           
        Name                                : export_1004           
        Reporting-interval                  : 60                    
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1004_1_1       
    Resource                                : /junos/system/linecard/interface/queue/ 
    Version                                 : 1.0                  
    Sensor-id                               : 3143454041            
    Subscription-ID                         : 1004                 
    Parent-Sensor-Name                      : sensor_1004          
    Component(s)                            : PFE                   

    Profile Information : 
                                        
        Name                                : export_1004           
        Reporting-interval                  : 60                    
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1004_1_2       
    Resource                                : /junos/system/linecard/interface/queue/extended-stats/ 
    Version                                 : 1.0                  
    Sensor-id                               : 3143454042            
    Subscription-ID                         : 1004                 
    Parent-Sensor-Name                      : sensor_1004          
    Component(s)                            : PFE                   

    Profile Information : 
           
        Name                                : export_1004           
        Reporting-interval                  : 60                    
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1004_1_3       
    Resource                                : /junos/system/linecard/interface/traffic/ 
    Version                                 : 1.0                  
    Sensor-id                               : 3143454043            
    Subscription-ID                         : 1004                 
    Parent-Sensor-Name                      : sensor_1004          
    Component(s)                            : PFE                   

    Profile Information : 
           
        Name                                : export_1004           
        Reporting-interval                  : 60                    
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1004_2_1       
    Resource                                : /interfaces/interface/state/ 
    Version                                 : 1.0                  
    Sensor-id                               : 3143450969            
    Subscription-ID                         : 1004                 
    Parent-Sensor-Name                      : sensor_1004          
    Component(s)                            : chassisd              
                                        
    Profile Information : 
           
        Name                                : export_1004           
        Reporting-interval                  : 60                    
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1004_3_1       
    Resource                                : /interfaces/interface/state/ 
    Version                                 : 1.0                  
    Sensor-id                               : 3143451993            
    Subscription-ID                         : 1004                 
    Parent-Sensor-Name                      : sensor_1004          
    Component(s)                            : mib2d                 

    Profile Information : 
           
        Name                                : export_1004           
        Reporting-interval                  : 60                    
        Payload-size                        : 5000                  
        Format                              : GPB                   

Sensor Information : 
           
    Name                                    : sensor_1004_4_1       
    Resource                                : /interfaces/interface/state/ 
    Version                                 : 1.0                  
    Sensor-id                               : 3143457113            
    Subscription-ID                         : 1004                 
    Parent-Sensor-Name                      : sensor_1004          
    Component(s)                            : xmlproxyd             

    Profile Information : 
           
        Name                                : export_1004           
        Reporting-interval                  : 60                    
        Payload-size                        : 5000                  
        Format                              : GPB                   

{master:0}

Other switches not yet in state 'active', so with no collectors trying to subscribe to them, also showed the above. But on the devices that things were failing for the command returned nothing:

cmooney@ssw1-d8-codfw> show agent sensors 

{master:0}
cmooney@ssw1-d8-codfw>

The Juniper docs say this command shows "whether or not J-Insight has successfully subscribed to sensors on which it is dependent", and was a recommended troubleshooting step on some of their pages. The docs said it was a process called agentd, which when I checked from the shell did seem to be running:

% ps aux | grep agentd
root    11221   0.0  0.7 495064  28176  -  S     1Jul24     0:26.38 /usr/sbin/agentd -N

In the end, somewhat in desperation and aware of upcoming work that would make ssw1-d8-codfw a live device (and thus no more experimentation) I decided to reboot it. As soon as it came back up the full output showed for "show agent sensors", and gnmic was able to subscribe and collect stats without problem.

Probably the next step here is a TAC case with Juniper to try and get to the bottom of it. No doubt it's some bug that will require a disruptive upgrade so not great news, but let's see.

Event Timeline

cmooney triaged this task as Low priority.
cmooney renamed this task from Issue with subscribing to GNMI telemetry on certain QFX5120 devices to Issue creating GNMI telemetry subscription to certain QFX5120 devices.Jul 17 2024, 9:20 PM
ayounsi claimed this task.

Thanks for the investigation ! Seems like the last step was :

asw1-b3-magru> restart analytics-agent gracefully
Analytics agent started, pid 87102

Solved the issue.

I updated the doc : https://wikitech.wikimedia.org/wiki/Network_telemetry#Juniper_show_agent_sensors_is_empty_despite_extension-service_being_configured