Page MenuHomePhabricator

Enable gNMI on SRX devices and fasw
Closed, ResolvedPublic

Description

In gNMIc's targets we currently filter for :

$attributes['role'] in ['asw', 'cr', 'cloudsw'] and $device !~ /^(asw2|fasw)/

To be able to have 1:1 parity between Icinga and Alertmanager we need to add collection for at least, pfw, mr type devices as well as the fasw named ones.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

+1. I think I disabled it on the fasw a while ago as it was unable to connect to them, and I was worried about wasting clock cycles trying. But since their upgrades I think they should be fine.

Leaving out asw2 is just to stop gnmic trying in vain to connect to the eqiad vc switches, we can leave that there for now.

Change #1132671 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] sre.network.tls: allow running it on more types

https://gerrit.wikimedia.org/r/1132671

Change #1132675 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] gNMIc: start collecting metrics from fasw, ignore asw1-eqsin VC

https://gerrit.wikimedia.org/r/1132675

Change #1132675 merged by Ayounsi:

[operations/puppet@production] gNMIc: start collecting metrics from fasw, ignore asw1-eqsin VC

https://gerrit.wikimedia.org/r/1132675

Change #1132671 merged by jenkins-bot:

[operations/cookbooks@master] sre.network.tls: allow running it on more types

https://gerrit.wikimedia.org/r/1132671

Change #1133058 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Allow gNMI to mgmt routers control-plane

https://gerrit.wikimedia.org/r/1133058

Change #1133060 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable gNMI on management routers

https://gerrit.wikimedia.org/r/1133060

Change #1133072 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable gNMI on management switches

https://gerrit.wikimedia.org/r/1133072

Change #1133058 merged by jenkins-bot:

[operations/homer/public@master] Allow gNMI to mgmt routers control-plane

https://gerrit.wikimedia.org/r/1133058

Change #1133075 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] gnmi policy: fix small issue

https://gerrit.wikimedia.org/r/1133075

Change #1133075 merged by jenkins-bot:

[operations/homer/public@master] gnmi policy: fix small issue

https://gerrit.wikimedia.org/r/1133075

Change #1133060 merged by jenkins-bot:

[operations/homer/public@master] Enable gNMI on management routers

https://gerrit.wikimedia.org/r/1133060

Change #1133072 merged by jenkins-bot:

[operations/homer/public@master] Enable gNMI on management switches

https://gerrit.wikimedia.org/r/1133072

Mentioned in SAL (#wikimedia-operations) [2025-04-01T13:04:58Z] <XioNoX> msw2-eqiad> restart jsd gracefully - T390052

Some updates on that front !

Fundraising switches (fasw)
All good.

Fundraising firewalls (pfw)
Waiting for the mr* to be working.

Management switches (msw)
After configuration, seems like only msw2-codfw have gNMI listening on port 32767, either because it's running the most recent Junos (23.4), or because it's an EX4400 (vs. 4300).
The others are accepting the configuration, so it's possible that there is a bug or daemon that needs a bounce. Worse case we upgrade them.

restart jsd gracefully does seem to properly start the deamon: Apr 1 13:02:02 msw2-eqiad jsd[85887]: JSD_GRPC_SERVER_VRF_BIND: mgmt_junos routing-instance bind to gRPC server succeeded.
but show extension-service request-response servers stays empty.

management routers (mr)
Here it's not working on any of them.
restart jsd gracefully shows lots of errors:

Apr  1 12:17:10  mr1-ulsfo init: can not access /usr/sbin/jsd: No such file or directory
[...]
Apr  1 12:17:10  mr1-ulsfo init: na-grpc-server (PID 17134) exited with status=1
Apr  1 12:17:10  mr1-ulsfo init: na-grpc-server (PID 17138) started
[...]
UI_RESTART_FAILED_EVENT: Failed to restart 'JET Services Daemon'

Or more self explanatory:

mr1-ulsfo> show extension-service request-response servers 
error: the jsd subsystem is not running

I couldn't find doc on how to get it running. I might have to contact JTAC.

I went to open a JTAC case for the non-working msw but they're all Out Of Support, I opened T390814: Upgrade management switches to Junos 21.4 to track their upgrade.

Opened JTAC case 2025-0402-657200 for the SRXs.

Icinga downtime and Alertmanager silence (ID=18436b96-18d3-4109-9dbe-088b91594c7c) set by ayounsi@cumin1002 for 0:30:00 on 1 host(s) and their services with reason: reboot

mr1-ulsfo

JTAC asked us to reboot it. It didn't help.

Bad news, JTAC told me that gNMI is not supported on SRX300 (or any branch level SRX) nor EX4300.

Some pointers :
https://apps.juniper.net/feature-explorer/feature/4332
https://apps.juniper.net/feature-explorer/feature/4327

On the other hand, EX4300 seem to be supported for features like https://apps.juniper.net/feature-explorer/feature/4284?fn=OpenConfig%20configuration%20model%20version%204.0.1%20for%20BGP%20with%20JTI so an upgrade might still be worth it.

TBD for the SRX1600.

Change #1133398 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] MR: rollback gNMI

https://gerrit.wikimedia.org/r/1133398

Mentioned in SAL (#wikimedia-operations) [2025-04-28T12:57:33Z] <XioNoX> test host-inbound-traffic system-services on pfw1-codfw - T390052

Change #1140162 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] gNMIc start collecting data from pfw

https://gerrit.wikimedia.org/r/1140162

Change #1140163 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable gNMI on pfw devices

https://gerrit.wikimedia.org/r/1140163

Change #1140163 abandoned by Ayounsi:

[operations/homer/public@master] Enable gNMI on pfw devices

Reason:

wrong file, should use the one in -private

https://gerrit.wikimedia.org/r/1140163

Change #1140162 merged by Ayounsi:

[operations/puppet@production] gNMIc start collecting data from pfw

https://gerrit.wikimedia.org/r/1140162

While the equivalent of this is needed on the pfw:

[edit security zones security-zone production interfaces lo0.0 host-inbound-traffic system-services]
-       any-service;
[edit security zones security-zone production interfaces ge-0/0/4.401 host-inbound-traffic system-services]
-       any-service;
[edit security zones security-zone production interfaces ge-0/0/4.402 host-inbound-traffic system-services]
-       any-service;

It's not solving the main issue on the management routers.

Mentioned in SAL (#wikimedia-operations) [2025-05-07T08:54:04Z] <XioNoX> update host-inbound-traffic system-services on pfw1-eqiad - T390052

Mentioned in SAL (#wikimedia-operations) [2025-05-13T08:31:40Z] <XioNoX> pfw1-codfw - delete specific system-services in favor of "any-service" T390052

Mentioned in SAL (#wikimedia-operations) [2025-05-13T08:34:35Z] <XioNoX> pfw1-eqiad - delete specific system-services in favor of "any-service" T390052

ayounsi changed the task status from Open to Stalled.May 13 2025, 8:46 AM

Current status:

mr update :

Thanks for your patience. I’ve observed similar errors in the lab setup, even though the latest documentation confirms that the feature should be supported in this version.
To address this, I’ve opened a bug and escalated the issue to the Engineering team to investigate the symptoms at the code level. I’ll keep you posted as soon as I receive any updates from them.

Change #1133398 abandoned by Ayounsi:

[operations/homer/public@master] MR: rollback gNMI

Reason:

not needed for now, will rollback later if needed

https://gerrit.wikimedia.org/r/1133398

From JTAC:

Hello Team,

Engineering found out the issue and I am awaiting the details of the RC from them and will share it here as soon as they have finalized the report along with the next step forward.

Thanks,

From JTAC:

Engineering stated that JSD the process code that manages gRPC missed being shipped to this platform and they are working to push the grpc library code to this platform as a solution.

Change #1133398 restored by Ayounsi:

[operations/homer/public@master] MR: rollback gNMI

https://gerrit.wikimedia.org/r/1133398

After a long wait Juniper says its not a bug, but they never implemented the feature :

Following discussions with our Engineering team, I’d like to update you regarding the gRPC/gNMI telemetry feature you were looking to test on your SRX300. Unfortunately, this feature is not currently supported on the SRX3xx platform. Enabling it would require raising an Enhancement Request (ER) and cannot be addressed through the PR I previously raised with Engineering/Development (referenced in the subject line) team.

Bringing back https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1133398 to rollback the gNMI config on the SRXs

Change #1133398 merged by jenkins-bot:

[operations/homer/public@master] MR: rollback gNMI

https://gerrit.wikimedia.org/r/1133398

Everything that can be done has been done.

We can revisit it the day the management switches or routers support gNMI.

Fundraising switches and routers are using gNMI properly.