Page MenuHomePhabricator

RFC: Wikimedia Push Notification Service
Closed, ResolvedPublic

Description

  • Affected components:
    • Echo extension for MediaWiki (producer, and consumer for web).
    • Mobile apps (consumer).
  • Engineer for initial implementation: Michael Holloway, Mateus Santos, Bernd Sitzmann, John Giannelos
  • Code steward: Product Infrastructure
  • Supported push platforms: apps (v1), web (v2+)
  • Target audience: registered users (v1). all users (including anons) (later)

Motivation

The Wikimedia product teams have a long-standing desire to leverage push notification technology to drive user engagement and retention.

Initial use cases
  • Edit reminders: Push notification support would allow product teams to re-engage new editors by thanking them for their contributions across platforms and offering them encouragement to continue contributing to the projects.
  • New contributor onboarding: Push notification support would allow Wikimedia product teams to engage new contributors by informing them about new features, settings, etc., across the product platforms.
  • User-specific notifications about on-wiki events: Push notification support would allow informing users in near-real time about on-wiki actions that affect them, regardless of whether they are currently logged in. This is essentially another notification mechanism for the kinds of events handled by Echo.
Requirements
  • Maintain a database of push subscriber information
  • Expose a public API allowing subscribers to create, view, update, or delete their push subscriptions
  • Expose a private (cluster-internal) API for requesting that a notification be pushed to one or more subscribers
  • Map received notification requests to push subscriptions and forward notification requests to applicable push vendors
  • Process notification requests in near-real time

Exploration

Status quo

The Echo extension to MediaWiki provides a system for defining events and associated notifications pertaining to on-wiki actions. At present, users can be notified via the Echo notifications UI on the web site, or by e-mail. Both are limited to logged-in users.

Prior art

Echo Web Push

A long-standing WIP patch from 2017 adds Web Push support to Echo. This RFC goes beyond the scope of that patch because it will support pushing notification requests to mobile app push notification providers in addition to web push.

2017 push notifications technical plan

A technical plan document for a proposed push notifications service was created in 2017, but the project never reached implementation. This RFC is heavily informed by that plan, but the primary use cases of interest have changed.

The highest product priority behind this RFC is providing notifications about on-wiki events to support new contributors. Connecting Echo to the planned push notification service is therefore in scope for this RFC. On the other hand, subscribing to page edits and topics, which were prioritized in the 2017 plan, are now out of scope for the initial investment on this RFC.


Proposal

A system for receiving and managing push notification subscriptions and transmitting notification requests to push providers.

System components

The push notifications architecture will consist of the following components:

  1. The primary component will be an external service for managing push subscriptions and receiving notification requests for submission to vendor push providers, including Mozilla Push Service (MPS), Apple Push Notification Service (APNS), and Google’s Firebase Cloud Messaging (FCM).
  2. A new push notifier type will be added to Echo to provide support for sending Echo events to the push service.

2020-05-12 revised architecture v2.png (595×602 px, 57 KB)

Subscription management

The system will be required to maintain subscription tokens for the appropriate push vendor for all subscribers. A public API will be created for clients to manage subscriptions. The specific endpoints available will be similar to those described in the 2017 technical plan.

The expected subscription/unsubscription flow is as follows:

  1. The client obtains a push subscription token (apps) or JSON blob (web) from the platform or browser, respectively.
  2. The client registers the subscription token (or blob, for web) with the currently authenticated user via the Action API.
  3. When the association between the push subscription and the currently logged in user becomes invalid (e.g., when the user is about to log out of the website or app), it is the client's responsibility to unregister the association via the Action API.

Subscription flow v2.png (1×1 px, 192 KB)

We propose storing push subscription data in a MySQL table managed by Echo. Our priorities for subscription storage are to ensure low-latency reads when processing notification requests, and to allow for easy horizontal scaling should the service’s resource requirements increase in the future.

A draft schema for subscription storage is here.

Notification request processing

The service will expose a private (cluster-internal) HTTP REST API for receiving notification requests for processing. The specific endpoints available for notification request submission will be similar to those described in the 2017 technical plan.

Our approach will be somewhat non-traditional, in that we will not directly push message content to clients. Rather, we will push messages that serve only to identify the type of message and where to retrieve pending messages (e.g., check_echo). This message will serve to wake up the client, which will then proceed to retrieve the messages directly from Wikimedia servers (e.g., from the Action API action=query&meta=notifications). A key strength of this approach is that maximizes user privacy by ensuring that no substantive message content passes through third-party servers. As a side benefit, it minimizes the client-side updates required for v1 in the apps, as they are currently polling the MediaWiki API for notifications while a user is logged in. The approach of pushing a message with no substantive content that prompts the client to wake up and retrieve messages is inspired by Signal.

The proposed notification request flow is as follows:

  1. Echo receives an event relevant to user User and emits a "push" type notification
  2. Asynchronously (in a deferred update or job queue job), the Echo push notifier finds the stored subscription ID for user User makes a request to the push service with the subscription ID, a message type (e.g., echo), and a timestamp
  3. Upon receiving the request, the push service identifies the push platform and platform subscriber identifier for the received subscription ID, and forwards the message to the relevant push service (e.g., Firebase)
  4. The push platform delivers the message to the user device or browser
  5. The receiving client wakes up and retrieves the full content of all pending messages of the relevant type from Wikimedia servers (e.g., by requesting action=query&meta=notifications)
  6. When a notification is read, the client follows up with a request to action=echomarkread

message flow android v5 (mdh).png (1×1 px, 214 KB)

Current estimates project a modest level of expected incoming notification request traffic (<1 req/s), though the rate of incoming requests will likely vary by time of day, month of the year (school in or out of session), and, given the product focus on new contributors, any campaigns in effect targeting new contributors.

Notification limits per event type

The product teams require configurable limits on the number of push notifications received within a specific period of time. Further, these limits should be configurable by event type. More specific requirements of this notification limit functionality will be entered in a Phabricator task after further product manager discussion.

Metrics

We will track the Four Golden Signals: latency, traffic, errors, and saturation.

Additionally, we will track product-oriented metrics both overall and per-platform, including:

  • Subscription request rate (req/s)
  • Subscription deletion request rate (req/s)
  • Total subscription count

Metrics must be compatible with Prometheus. Alerts will be configured for request spikes or when error rates pass a reasonable threshold.

Sunset/Rollback

Push notifications are not critical to the operation of MediaWiki, and disabling them should not negatively affect any other software component running in Wikimedia production. The push notifications infrastructure will be largely self-contained to permit easy shutdown in the event of emergency or if push notifications are no longer needed by the Wikimedia products.

Why an external service?

The Wikimedia developer community has adopted a set of criteria for assessing whether a feature may be implemented as a service external to MediaWiki. According to these criteria, the proposed functionality is suitable for implementation as an external service. The functionality is self-contained and does not depend on having a consistent and current view of MediaWiki state. It does not require direct access to the MediaWiki database, and does not require features or functionality provided by MediaWiki or its extensions. Furthermore, a push notification service will likely involve resource usage spikes that make the functionality more suitable for running in a separate, dedicated environment.

Several open-source push notification server projects exist on the web. Building in an external service will allow us to scale independently of MediaWiki, and working from an existing open-source project will allow us to build on the lessons learned by prior push notification service implementers so that we can focus the greater part of our efforts on implementing any specific custom functionality required by Wikimedia Product and ensuring that the service meets the requirements of Wikimedia’s production environment.

The push service will be written in Node.js, which we have substantial experience working with and running in production. We reviewed several existing open-source push service projects, and found that any of them would require significant updating in order to meet our product and operational requirements. None are suitable out of the box; indeed, most of the projects have been unmaintained for years. This being the case, we plan to build a new Node.js service based on the Wikimedia node service template. Implementation details on request handling and interactions with push vendors will be informed substantially by DailyMotion’s pushd project.

Why not use Change Propagation rules rather than building a new service?

The service proposed here responds to events in MediaWiki by making HTTP requests to vendor push service providers. This is similar at a high level to what the existing Change Propagation service does. But we cannot simply use Change Propagation rules in lieu of building a dedicated service, because there are some specific requirements here that go beyond what Change Propagation is meant to support. Change Propagation rules should be stateless, but we will need to be able to do things like batch outgoing requests or conditionally enqueue requests for submission at a later time. We may also need to implement one or more strategies currently under discussion with Privacy Engineering for mitigating risks to user privacy, such as sending "decoy" messages along with real notification requests, or introducing a randomized delay period between receipt of an Echo notification and submission of a push notification request.

Note: Separately, Echo will most likely use the MediaWiki job queue, which is backed by Change Propagation in Wikimedia's environment, for enqueuing the HTTP requests for forwarding Echo notifications to the push service.

Scope of work (Q4 2019-2020)

The Product Infrastructure team will work in Q4 to launch the baseline push notification infrastructure. This effort will include:

  1. Building and launching the external service to manage push subscriptions and notification requests;
  2. Adding a new Echo notifier type to allow for submitting events to the external push notification system; and
  3. Adding a table to Echo to maintain a mapping of Wikimedia central user IDs to push service subscription IDs.

Our initial focus will be on supporting push notifications in the apps. The Wikipedia apps have long been missing native push notifications, a basic feature of most mobile apps today. Supporting the apps first will also provide us with the opportunity to validate the push notification infrastructure and to work out any issues before exposing it to traffic at Wikipedia’s web scale. Support for web push will be a reach goal for initial rollout.

We anticipate that new Echo event types will be required in the future to support envisioned Product use cases for push notifications. However, this is fundamentally an infrastructure project and not a feature project, and no new Echo event types will be created as part of the initial release.

See also

Platform-specific push API docs

Web:

iOS:

Android:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I imagine notifications will be real-time(ish) so Echo's event bundling and event cancelling is not an issue here (and if clients need them, they will have to reimplement them on the client side)?

@Tgr I searched the Echo codebase for "cancel" but couldn't find anything. It does look like there are provisions for deleting events/notifications from the DB for one reason or another, though — is that what you're referring to?

Anyway, in general, yeah, push notifications are ephemeral and ideally delivered in near-real time, and this won't involve any of the presentation-focused pieces in Echo like bundling (though it will take advantage of platform-specific mechanisms for doing the same kind of thing).

Krinkle updated the task description. (Show Details)
Krinkle moved this task from P3: Explore to P1: Define on the TechCom-RFC board.
Krinkle subscribed.

@Mholloway I've updated the task to use the new RFC template. Can you fill in a short summary of the requirements that the service or extension must have at minimum, and confirm who the code steward will be? After that, freel free to move it back to Phase 3: Explore.

Mholloway moved this task from P1: Define to P3: Explore on the TechCom-RFC board.

Thanks for the updates, @Krinkle!

@Tgr I searched the Echo codebase for "cancel" but couldn't find anything. It does look like there are provisions for deleting events/notifications from the DB for one reason or another, though — is that what you're referring to?

"Cancelling" was a misleading way of calling it, sorry. EchoEventPresentationModel::canRender can be used to skip displaying a notification if, based on the current state, it is not relevant anymore (e.g. it's a notification about how a page you created has been linked to a Wikidata item, but the page has been since deleted).

Updated the diagram to explicitly include Cassandra (and other minor, cosmetic updates).

Now tuning the proposal based on several consultations.

At risk of muddying the waters, @JoeWalsh and I were looking at the Signal app source code recently and noticed that it uses a rather interesting approach to push notifications: it sends an empty notification which prompts the app to wake up and retrieve message content from Signal's servers. I think this approach could also translate well to web push. The chief benefit of following this pattern would be that no message content would transit through push provider servers, which is a privacy win. Are there any strong reasons (particularly from an operations perspective) to rule out this approach?

At risk of muddying the waters, @JoeWalsh and I were looking at the Signal app source code recently and noticed that it uses a rather interesting approach to push notifications: it sends an empty notification which prompts the app to wake up and retrieve message content from Signal's servers. I think this approach could also translate well to web push. The chief benefit of following this pattern would be that no message content would transit through push provider servers, which is a privacy win. Are there any strong reasons (particularly from an operations perspective) to rule out this approach?

The only reason would be we need an additional endpoint to show the notifications - and to either store them in the push system or gather them via an API call to the source. I don't see either as a big issue, but it will need more detailed clarification.

I have a few questions and observations about this RFC, but let's start from the basics:

I think the proposal needs expansion on some important details:

  • It's not clear how Extension:Echo would send an event to Extension:Pushnotification in your diagram. I suppose a simple call to a method?
  • MediaWiki communicates with pushd; how? I would expect us to use an asynchronous method, and I also expect it to use our Modern Event Platform. Specifically, I see 3 options - and I would like the RFC to be more explicit about it:
    • MediaWiki sends an event to Eventgate, pushd consumes the kafka topic directly
    • MediaWiki sends an event to Eventgate, and we configure change-propagation to submit the notification to pushd
    • MediaWiki spawns a job, and the job takes care of submitting the notification to pushd
  • Likewise, submission from other internal services (if ever) should only happen via the same method. I would advise against having anything coupled to this API synchronously.
  • The reasons for choosing Cassandra are not well explained, and that needs expansion. See my next comment for further details
  • It's not clear to me if it's a good idea to keep the data about push notifications completely in the notification service, or leave it to the caller to manage all those details and keep the service "dumb". This is not discussed in the RfC though, and I'd like to see some more details / rationale on that.
  • For metrics to collect, let's ensure we make them compatible with Prometheus, and let's use the standard Latency, Traffic, Errors, and Saturation metrics. I will expand in a later comment on this.

A general observation first: It's not clear which application will be responsible of storing subscription data. I would assume, if we expect multiple possible sources of subscriptions, that those sources would keep track of their own subscriptions. But I can see arguments in the other direction - for example, maintaining those in a centralized place would make it easier for people to manage them. Anyways, this should be clarified in the RFC.

Getting to the specifics, you state

We propose storing push subscription data in Cassandra to ensure low-latency reads when processing notification requests, and to allow for easy horizontal scaling should the service’s resource requirements increase in the future. Eventual consistency as provided by Cassandra is sufficient for this use case.

Let me first clarify a couple of misconceptions here:

  • Cassandra is not optimized for low-latency reads; it's optimized for having a high write throughput and good read latency. Read latencies on cassandra are in the orders of several milliseconds, compared to nanoseconds on e.g. Redis or Memcached. It's also definitively slower than a properly tuned MySQL for almost any read pattern.
  • We know fairly well how to horizontally scale MySQL

The only reason for using Cassandra would be for being multi-dc aware. If, as I suspect, this storage will have a lot of ties to MediaWiki (as in - subscription logic and authn/authz will be tied to it), then that's not even an issue. Also, I see pushd uses redis, which is *definitely* not multi-dc aware. So at least that component will need to be "stateless" and to not sync across datacenters.

In all, I think the RFC needs to be expanded on the topic of what you want to store, where and how. I don't see a strong argument for using cassandra here, and please remember MySSQL has a much lower cost/GB and in general we have a better management of the storage system.

At risk of muddying the waters, @JoeWalsh and I were looking at the Signal app source code recently and noticed that it uses a rather interesting approach to push notifications: it sends an empty notification which prompts the app to wake up and retrieve message content from Signal's servers. I think this approach could also translate well to web push. The chief benefit of following this pattern would be that no message content would transit through push provider servers, which is a privacy win. Are there any strong reasons (particularly from an operations perspective) to rule out this approach?

Doesn't end-to-end encryption give you the same privacy benefits without the need for the extra fetch? (Which means it also wouldn't break offline / poor connectivity notifications, although we probably don't have much use for that.)

chasemp mentioned this in Unknown Object (Task).Apr 29 2020, 2:44 PM

Hey @Joe, thanks for your feedback on this. Our plans evolved somewhat last week based on your feedback as well as ongoing team-internal discussions. See brief notes below...

  • It's not clear how Extension:Echo would send an event to Extension:Pushnotification in your diagram. I suppose a simple call to a method?

In the plan described above, the PushNotifications extension registers a new Echo notifier type, "push"; in doing so, it also registers a static method that Echo can call when Echo identifies that (a) a user should be notified about an event that has occurred, and (b) push is one of the ways in which the user has chosen to be notified about such events.

That said, we recently agreed that it's best to keep things simpler by defining the new notifier type directly in Echo itself rather than launching a new extension, provided that the Growth team (Echo's primary maintainers) are on board with that plan. (N.B. This is the approach taken in Roan's older WIP patch adding web push.)

  • MediaWiki communicates with pushd; how? I would expect us to use an asynchronous method, and I also expect it to use our Modern Event Platform. Specifically, I see 3 options - and I would like the RFC to be more explicit about it:
    • MediaWiki sends an event to Eventgate, pushd consumes the kafka topic directly
    • MediaWiki sends an event to Eventgate, and we configure change-propagation to submit the notification to pushd
    • MediaWiki spawns a job, and the job takes care of submitting the notification to pushd

Yes, this will definitely happen asynchronously. My currently proposed patch has the HTTP POST request happening in a deferred update, but it could easily be converted to a job to be handled by cpjobqueue. I'm not sure where the line is between what belongs in a deferred update and what belongs in a job queue job; the task to be performed would be an HTTP request to a service running in the same cluster, where we won't be doing anything with the response except probably logging error responses.

  • The reasons for choosing Cassandra are not well explained, and that needs expansion. See my next comment for further details

After some further internal discussion, we've arrived at consensus that MySQL will probably suit our needs as well or better. I don't think we have any specific need to be multi-DC aware at this point.

  • It's not clear to me if it's a good idea to keep the data about push notifications completely in the notification service, or leave it to the caller to manage all those details and keep the service "dumb". This is not discussed in the RfC though, and I'd like to see some more details / rationale on that.

There's no plan for the service to store any data about notifications themselves. Are you referring specifically to subscription data? The main idea here is that the service should be agnostic with respect to the source of incoming notification requests. We want the service to be able to handle notification requests from arbitrary sources without having to make a request in turn to MediaWiki for subscription data.

  • For metrics to collect, let's ensure we make them compatible with Prometheus, and let's use the standard Latency, Traffic, Errors, and Saturation metrics. I will expand in a later comment on this.

Will update, along with other updates to the description to reflect the current state of our thinking.

  • For metrics to collect, let's ensure we make them compatible with Prometheus, and let's use the standard Latency, Traffic, Errors, and Saturation metrics. I will expand in a later comment on this.

When we launched the Wikifeeds service on the pipeline a little while ago, I noticed that a dashboard for the service with those metrics appeared in Grafana, though I'm not sure if it happened automagically or someone other than me went to the trouble of creating it manually. In any case, that suggests that we can pattern our metrics collection after what Wikifeeds is doing.

For now I'll make a note in the Metrics section that we should be collecting these.

Doesn't end-to-end encryption give you the same privacy benefits without the need for the extra fetch? (Which means it also wouldn't break offline / poor connectivity notifications, although we probably don't have much use for that.)

Yeah. We debated this for a while, and decided that the ease of implementing it this way was worth the cost of the extra fetch.

Updated to reflect that subscription data will be managed by Echo.

I think the subscription management diagram should be updated as well.

We don't expect any further changes to the overall architecture. Moving to Last Call.

I should note that we are also consulting separately with security and privacy engineering on user privacy risks and mitigation strategies.

Last Calls are started by TechCom and involve an announcement with end date as well. I've taken note of this and will make sure TechCom reviews this within two weeks. We actually have a meeting later today, so I'll try to bring it up there.

I'll at least want to hear from Joe to know that they have read and confirmed their concerns have been addressed before starting the Last Call. Security and privacy seem like important stakeholders here. That kind of consultation fits within the phase 4 that we are in. I'm not sure it makes sense for us to review and approve this knowing your consultation with them is still pending.

@Mholloway Can you confirm that you've read through and identified no compromises or open questions from the Architecture principles in relation to this service? (see process)

@Krinkle Sorry for jumping the gun. I was under the impression that the RFC process had or was in the process of moving toward a self-service model.

Thanks for the pointers to the process overview doc and Architecture Principles. I've reviewed those principles and updated the RFC description to clearly define the intended target audience and supported target platforms per RUN/MORE. The decision to require an extra fetch for users to retreive notification content also implicates device and network equity; I've updated our decision record page per EQUITY/DEVICE to explicitly note the tradeoff and indicate that we may need to revisit the decision if it turns out in practice to exclude a significant number of users from receiving push notifications.

kchapman subscribed.

TechCom is placing this on Last Call ending on 27th of May.

@Joe can you confirm you concerns have been addressed?

Sorry, I misspoke yesterday; the Security Preview for this plan is complete. However, the privacy review is still in progress.

  • For metrics to collect, let's ensure we make them compatible with Prometheus, and let's use the standard Latency, Traffic, Errors, and Saturation metrics. I will expand in a later comment on this.

When we launched the Wikifeeds service on the pipeline a little while ago, I noticed that a dashboard for the service with those metrics appeared in Grafana, though I'm not sure if it happened automagically or someone other than me went to the trouble of creating it manually. In any case, that suggests that we can pattern our metrics collection after what Wikifeeds is doing.

For now I'll make a note in the Metrics section that we should be collecting these.

There is some magic involved, of course, but that was because your application uses service-runner. Given pushd is a third-party application, we might need to add support for exposing metrics to it (I didn't check what it exposes already).

TechCom is placing this on Last Call ending on 27th of May.

@Joe can you confirm you concerns have been addressed?

I have just one additional concern at this point.

We're basically creating a service that listen to events emitted by MediaWiki, and send HTTP requests some vendor, depending on the characteristics of the notification message. This seems exactly the job that change-propagation does. I would ask the author of the RFC to explain - briefly - why change-propagation wouldn't be suitable to do the part of the work that pushd is supposed to do, or if further considerations rather than features made them opt for pushd.

Anything else is taken care of as far as I'm concerned and is pretty solidly laid out.

@Joe I've updated the description with a brief discussion of our needs above and beyond what ChangeProp rules alone can handle.

Also, I should clarify our current thinking around development of the push service: our current plan is to start from service-template-node for consistency with our other services, and to incorporate patterns for interacting with the external push services from pushd. So, we do expect to be using service-runner.

@Joe Based on Michael's response it seems that at least as-is ChangeProp cannot support this use case, but I can't tell whether it is feasible/desirable for it to accomodate this use case as-is. E.g. by adding support for it there.

Can you say whether this is desirable to add as responsibility to ChangeProp and/or a derived service that works similar to it? Or would it be fine operationally to have its own service as currently proposed?

/cc @Pchelolo as ChangeProp maintainer. Do you think this is something ChangeProp could/should be made to support?

Based on todays' TechCom meeting there are no other concerns and previous issues were addressed, so we're ready to approve this if the above is addressed by Joe/Petr.

@Krinkle I think it's perfectly ok to not use changeprop - I just wanted to get some clarification as to why to be in the RFC so that that analysis is explicit and documented. I have no concerns regarding the RFC as it is.