It looks when push notification service gets initialized, grafana shows a high latency. That happens without even having traffic hitting the service.
https://grafana.wikimedia.org/d/NQO_pqvMk/push-notifications?orgId=1&from=1602201152770&to=1602403260466&var-dc=eqiad%20prometheus%2Fk8s&var-service=push-notifications
Description
Related Objects
- Mentioned Here
- T263058: Memory leak in node-apn
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2020-10-13T10:47:26Z] <jayme> no-change rolling restart of push-notifications in codfw - T265258
A simple restart of the pods in codfw (without any actual change) triggered the same behavior, so I will deploy the update to envoy 1.15.1 (T264157) without this being solved as it is unrelated.
@jijiki Nothing comes in mind for that specific task, given that we still don't use it with production data. We still need to figure out why this is happening on the app level.
My assumption is that the memory leak issue in node-apn (T263058) is triggering high load on CPU when the garbage collector is invoked after the restart. Still investigating though.
@MSantos the memory leak fix is in production for some time now, should we close this one?