In recent weeks we have started to see gaps in the Network statistics in Prometheus we are gathering using gNMI. This has occurred after the recent changes in configuration - and additional BGP subscriptions - however the strange thing is we didn't have any problems in the first few days following those changes.
It may be a performance issue however it's worth noting that we do not have the problem in codfw, where we have more subscriptions and metrics (because top-of-rack switches there are independent L3 devices that support gnmi, unlike eqiad).
The gaps are not consistent between different measurements or devices. In other words we might have stats for a particular interface on a device at a given time, but be missing the data for another interface on the same device at that moment. The gaps are frequent and occurring on all devices however, effectively making them unusable right now.
The gaps can be seen in the raw stats returned querying Thanos - see P73487, so it's not simply a matter of the promql queries we are using in Grafana.
It may just be a performance thing, in which case perhaps moving to K8s is a way forward, however it's a mystery why codfw is ok so I'd rather get to the bottom of it before assuming that. First-step will be to run gnmic in debug mode on netflow1002.eqiad.wmnet and see what that shows.

