Page MenuHomePhabricator

Instrumentation for Push Notifications
Closed, ResolvedPublic

Description

This task needs to be broken down further, but writing down the things.

  • Classical observability/operational metrics (provided by service-runner)
  • Types of notifications that are being spawned
  • Enrollment & disenrollment
  • Callbacks to notification API endpoints for retrieval

Be aware of potential needs for sampling as this can be very high throughput.

Consider user privacy when designing schema(s), and be aware of where events may be published.

Metrics section from the RFC:

We will track the Four Golden Signals: latency, traffic, errors, and saturation.

Additionally, we will track product-oriented metrics both overall and per-platform, including:

  • Subscription request rate (req/s)
  • Subscription deletion request rate (req/s)
  • Total subscription count

Metrics must be compatible with Prometheus. Alerts will be configured for request spikes or when error rates pass a reasonable threshold.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dr0ptp4kt renamed this task from Instrumentation to Instrumentation for Push Notifications.Apr 30 2020, 4:43 PM

Are we waiting on something from @dr0ptp4kt here, or was it assigned in error?

I forget, but yeah, let's unassign me from this epic.

As confirmed at T249065#6151008, we'll get baseline operational metrics (i.e., the Four Golden Signals) for "free" by virtue of using service-runner and deploying on the k8s pipeline. I think our biggest task here is to think through what metrics we need to collect to demonstrate the value of the project. I'll update the task description to reflect this and to pull in the metrics noted in the RFC.

Created T254923 to separately track setting up a dashboard once product metrics have been defined and instrumented.

Additionally, we will track product-oriented metrics both overall and per-platform, including:

  • Subscription request rate (req/s)
  • Subscription deletion request rate (req/s)
  • Total subscription count

Since subscription management is handled in the Echo MW extension instrumentation needs to happen there. I wonder if we should split up this ticket into one for Echo and another one for the Node service.
For the Node service, do we need to instrument anything in addition to what service-runner provides? A few ideas that come to mind are since we are adding delayed sending of messages:

  • size of buffered queue
  • delay of messages before they are forwarded to the providers
  • number of messages sent to different providers (FCM, APNs, MPS, ...)

A couple of questions about the metrics

  • Except of the metrics provided out of the box by service-runner, are we going to use the app.metrics infrastructure provided by the package?
    • As far as I understand, our infrastructure is based on prometheus but service-runner sends statsd metrics. Is this exported to prometheus using statsd_exporter?
  • For metrics that refer both to the Echo extension and the node service, do we want to send metrics from both projects?
    • Eg: # of FCM notifications on Echo and # of FCM notifications on push-notifications service

I'm going to ping @Pchelolo to confirm my answers about service-runner metrics support here, since I'm only able to piece together what's going on by secondary evidence (i.e., some patches and PRs), but as a first attempt:

  • Except of the metrics provided out of the box by service-runner, are we going to use the app.metrics infrastructure provided by the package?
    • As far as I understand, our infrastructure is based on prometheus but service-runner sends statsd metrics. Is this exported to prometheus using statsd_exporter?

Yes and yes. There is a PR open for native prometheus support in service-runner, but I'm not sure what the timeline or priority is for that work: https://github.com/wikimedia/service-runner/pull/230

  • For metrics that refer both to the Echo extension and the node service, do we want to send metrics from both projects?
    • Eg: # of FCM notifications on Echo and # of FCM notifications on push-notifications service

I'd be interested in hearing from Product Analytics on this point, but IMO just tracking in the push-notifications service is sufficient for now. That said, we will get tracking of MediaWiki push notification request jobs essentially for free via cpjobqueue metrics.

Yes and yes. There is a PR open for native prometheus support in service-runner, but I'm not sure what the timeline or priority is for that work: https://github.com/wikimedia/service-runner/pull/230

Yeah, confirmed. Native prometheus metrics are now in use by eventgate, and we haven't yet prioritized moving the rest to it, but it's definitely the future. So if you are ok with some pain of being early adopters, native prometheus metrics are the way to go. However, if you want to stick with the known and migrate later - use statsd metrics.

Change 618307 had a related patch set uploaded (by Jgiannelos; owner: Jgiannelos):
[mediawiki/services/push-notifications@master] Add metrics instrumentation for APNS/FCM

https://gerrit.wikimedia.org/r/618307

Change 618307 merged by jenkins-bot:
[mediawiki/services/push-notifications@master] Add metrics instrumentation for APNS/FCM

https://gerrit.wikimedia.org/r/618307

MSantos added a subscriber: MSantos.

I think this is in good shape for v1, closing as resolved.