Page MenuHomePhabricator

Performance review of push notifications infrastructure
Closed, ResolvedPublic

Description

Description

The goal of the project is to use push technology to provide MediaWiki notifications to Wikimedia web and app users. The initial focus of the project is on providing push notifications for the apps.

The push notification system consists of two components:

  • Push subscription management and notification request support in the Echo extension; and
  • A standalone service (push-notifications) for batching requests for submission to third party push providers.

In the initial (apps-only) implementation, to mitigate the user privacy and security risks associated with sending potentially sensitive information to users via third-party systems, pushed messages will contain only a short descriptor (e.g., checkEchoV1) to prompt the apps to retrieve message content from Wikimedia servers (e.g., by request to the Action API's notifications module). Further, the push-notifications service will collect push notification requests and send them in periodic batches in order to increase the difficulty of drawing inferences between on-wiki actions and messages traversing the public internet.

TechCom RFC: T249065: RFC: Wikimedia Push Notification Service
Target deployment date: TBD

Preview environment

This code is deployed to the Beta Cluster. The push-specific additions to Echo are live on all Beta Cluster wikis, and the extension is configured to send push notification requests to the push-notifications service instance running on deployment-push-notifications01.

Which code to review

Additional work in progress:

Performance assessment

Please initiate the performance assessment by answering the below:

  • What work has been done to ensure the best possible performance of the feature?

I would expect this feature to have little to no performance impact on the sites.

With these changes to Echo, if a site is configured to add a push notifier type, when an Echo event occurs, the push notifier's handler function determines whether the user to be notified has elected to be notified about events of that type by push, and if so, enqueues a job to asynchronously retrieve the user's stored push subscriptions and make a request to the push service. An index has been added to the push subscriptions table (echo_push_subscription) to ensure fast lookups of push subscriptions by a user's central ID.

The push service itself should have no impact on the performance of the sites.

  • What are likely to be the weak areas (e.g. bottlenecks) of the code in terms of performance?

A primary reason for having a standalone service for forwarding push notification requests on to the push platform gateways, as opposed to a lighter-weight solution such as ChangeProp rules, is to implement custom batching behavior recommended during privacy review to help mitigate risks to user privacy. In brief, the goal is to add entropy to the system to make it more difficult for an attacker to infer connections between actions on-wiki and notifications sent to a particular device. These recommendations include sending requests on to push provider gateways in batches; introducing randomzied delays before batched request submission; and sending no-op requests to add additional noise.

As such, the service is essentially intended to slow down requests. A key question (and a potential weak point) is how the service will perform under load.

  • Are there potential optimisations that haven't been performed yet?

A more aggressive optimization on the MediaWiki side would be to have the push notifier handler function immediately enqueue a job for each Echo event without doing any check for whether push notifications apply to the user and event type; that would instead be checked at the time the job is performed. This would likely eliminate any possibility of a direct performance impact arising from the feature, but at the cost of sharply increasing the number of jobs enqueued (which might have an indirect performance impact).

  • Please list which performance measurements are in place for the feature and/or what you've measured ad-hoc so far. If you are unsure what to measure, ask the Performance Team for advice: performance-team@wikimedia.org.

No measurements are yet in place. We would appreciate guidance on what measurements are needed in this case, if any.

Event Timeline

Hey @Gilles , we're submitting this like you asked with an initial overview, but we are still fleshing out a tech spec and we will be discussing it with tech com and the architecture team during March.

We will come back and flesh the task out with more detail when we have more detailed information.

I've reviewed the RFC, the plan sounds good to me. Let us know when you have something running.

Gilles changed the task status from Open to Stalled.May 4 2020, 11:05 AM

Just for my own tracking, change it back when there is something new for me to review. Thanks!

Mholloway changed the task status from Stalled to Open.Jul 21 2020, 7:43 PM

Flipping this back to "open" per T246707#6104092. There are a couple of patches still open (mostly adding tests) and one point on which we're waiting for confirmation from Security and Privacy Engineering, but I don't think it needs to block this from moving forward.

Is the existence of the push-notifications service merely for the custom batching? Wasn't it possible to use existing batching in the job queue or extend it to support the needed privacy requirements?

The addition of this service layer introduces unknowns in terms of reliability under load, memory leaks, data loss, etc. It sounds like a custom job queue after the job queue. As much as both the job queue and Firebase Cloud Messages are well battle-tested, this new service in between has yet to handle this kind of load and be long-running while treating large amounts of data. Which means it's likely to be a pain point and take some time to iron out. It also seems like a missed opportunity to not make the general job queue capable of privacy-preserving batching, which is bound to be a feature we will need again in the future in a different async job context.

I looked around in the Echo code but couldn't find where the notification jobs are actually created. Can you point me to that? You described that some checks (with DB access, I presume) are performed before the decision to create a job is made. Is this logic running during a POST request, or during the "post-send" phase of a GET request? Or maybe it's actually happening inside an Echo job? Essentially I'm trying to understand if those checks and the insertion of the job are happening at a time cost visible to the user, or if it's happening after the HTML of the request has been served to the user, if any.

To start with the easy questions first, about what the code is doing now: jobs are created and pushed onto the job queue in EchoPush\PushNotifier::notifyWithPush(). The only check performed at this point, before enqueuing an EchoPushNotificationRequest job, is to determine whether the user has selected to be notified about the event type with push. This involves querying the user_properties table. The context in which this method is called depends on whether MediaWiki is configured to process notifications for the event type immediately or via the job queue. For event types for which notifications are processed using the job queue (which is the case for most events), it will be called from the context of a EchoNotificationJob. I believe the only type not processed in an EchoNotificationJob is notifications when a user's talk page is edited; in this case, notifyWithPush will be called in a deferred update at the end of the POST request in which the talk page edit is made. At the time the EchoPushNotificationRequest job is executed, it then queries the echo_push_subscriptions table for the user's subscriptions and issues requests to the service as needed.

As for the role of the Node.js service in the overall architecture: yes, I am not entirely surprised at this feedback. Since the decision to store subscriptions in a table managed by Echo rather than having the service itself manage subscription storage, all the service is really doing is providing the custom batching behavior we need and communicating with provider APIs. That said, there are better libraries available for interacting with the provider APIs in Node.js than in PHP. For example, Firebase provides an official library for API interactions in Node.js but not PHP, and I believe the community-developed Node.js library for interacting with Apple Push Notification Service is more battle-tested than its PHP counterpart. I assume you are envisioning batched requests to the provider APIs being issued directly from the MediaWiki job runners, but I could be wrong about that.

Ok, great, in that case since the notifications are sent from jobs or deferred updates there are no performance concerns for the end user. You might want to instrument the time spent actually calling the service(s) from PHP's perspective, since I assume you wait for request completion to decide whether the job needs to be retried.

Indeed, what I was getting at is that the PHP job would be talking directly to Google and Apple's services and that any special batching you might need could be handled by the job queue.

I understand the concern about the libraries, but I assume that those services expose simple REST APIs? Furthermore, based on your description of the solution to send a generic "please phone home" notification, it sounds like you're not using any advanced features with these notification services. Meaning that as long as the library can send the couple of HTTP requests expected for the general case, it should work just fine.

The APN PHP library seems to be kept up to date: https://github.com/ZhukV/AppleApnPush

As for Google's API, it seems to be just an HTTP request after OAuth2 auth for the general case, which you can really do without a library. And if you want one, this one is actively maintained: https://github.com/kreait/firebase-php and seems trivial to use for basic notification sending.

If you were using come cutting-edge features of those notification services and the PHP libraries were lagging behind, I would understand that a more active Node library would be more appealing. But if we're talking about the most basic kind of notification sending, any reasonably active library project should do. Or even no library if the REST API request construction is trivial for the basic case.

Just want to note that, although the service is pretty simple as of v1, complexity may increase overtime once we keep up with the road map [1]. Even though we might lack knowledge regarding the nuances of the Job que architecture, the reason we chose the service approach was to have more control over the notifications underlying logic and possibly have a separate standalone storage to manage notification states, because we might need to solve edge cases such as privacy, per language setup, anonymous user, etc.

[1] https://mediawiki.org/wiki/Wikimedia_Product_Infrastructure_team/Push_Notifications_Infrastructure

I don't see that this added business logic complexity is incompatible with this happening inside a job. In fact it sounds like your service is going to run that logic as a job/request inside the service.

I think that premature separation as a service, before the product demands for features really settle (I expect that once we start having notifications, new demands about new types of notifications will emerge), might put you in a Node Parsoid type of situation. Down the line you might need more context from MediaWiki for your notifications and you service will end up calling MediaWiki APIs, making what was initially a cleanly separated service something very tangled with a lot of back-and-forth data chatter. Whereas "staying" in MediaWiki, you retain direct access to all the data you need. Maybe that's a desirable thing for a new product whose later direction is still a bit unknown.

There might be things you can do with a separate service that are impossible inside the context of the job queue, and where extending the job queue to do those things would be undesirable or impossible. I think you should identify what those are, even if they're only on the roadmap for now.

It's also possible that by looking harder at the job queue and what it can or could do, that some of the batching responsibilities you were envisioning for the service might end up being handled by the MediaWiki job queue, while still retaining a separate service for other reasons. It simply seems to me like you should thoroughly explore the batching potential at that level, because right now you're seemingly creating a layer of batching right after an existing one.

I've looked at the batching logic, is it actually joining messages into a single HTTP request, or is it just sending out all the buffered requests one after the other? This suggests that it just does it one after the other, if I understand that code correctly.

If the messages aren't batched into fewer HTTP requests send to the 3rd-party notification services, it sounds like they are only grouped in time?

If the goal for this architecture is to reduce the correlation between when the event happened and when the notification was sent, why isn't a randomised delay after each event sufficient? This is something the MediaWiki job queue can do out of the box with getReleaseTimestamp(). If you want to preserve notification order in time for a given user, you can probably look up existing jobs for the user currently in the queue, or store the next notification timestamp somewhere in the user session.

I don't think that the batching (or rather, delaying all until a counter reaches a certain value) + random time limit adds more entropy than a random delay for each individual event/notification. In fact, in cases of high event traffic I think that in your service's current implementation, there would be no delay, as the max buffer size would be reached quickly/instantly, which defeats the privacy protection attempt. In that sense adding a random delay to each event/notification is a better privacy protection, as it works in any volume scenario.

I've looked at the batching logic, is it actually joining messages into a single HTTP request, or is it just sending out all the buffered requests one after the other? This suggests that it just does it one after the other, if I understand that code correctly.

Notifications are grouped per type (eg. CheckEchoV1) and provider (APNS/FCM) and then batched to multicast API requests that target devices up to a max length (MAX_MULTICAST_RECIPIENTS defined in config).

OK, I see it now, that's done via getBatchedMessages().

In that case, the job queue doesn't support, as far as I can see, this type of deduplication/grouping of individual jobs. But one could easily store notifications to be sent and schedule a deduplicated job for the type, running with the randomised delay of your liking. That job would pull all pending notifications of a type, group them and send them.

The advantage of storing that list of pending notifications (in a DB table, probably) is that it would be more robust. Currently if this service dies, you lose an unlimited amount of notifications that were only stored in the service's memory. If the service dies due to overload or OOM, the amount of lost notifications could be large. There's no redundancy either if the machine hosting an instance of the service dies. Worse, you don't even know which notifications were lost when a failure event happens without digging in the service logs (if the service logs this at all).

A DB-backed queue consumer by a MediaWiki job would, thanks to DB transactions, only delete pending notifications once they've been successfully delivered to the different notification services. This also makes the system more robust in case of an outage at Apple or Google for example. Pending notifications would keep being queued and retried until the 3rd-part becomes responsive again.

The advantage of storing that list of pending notifications (in a DB table, probably) is that it would be more robust. Currently if this service dies, you lose an unlimited amount of notifications that were only stored in the service's memory. If the service dies due to overload or OOM, the amount of lost notifications could be large. There's no redundancy either if the machine hosting an instance of the service dies. Worse, you don't even know which notifications were lost when a failure event happens without digging in the service logs (if the service logs this at all).

Agreed, just want to note that this is currently expected behavior since the goal for v1 was to have something minimal in order to validate the whole push-notification concept with the app's clients. For that, we chose [1] to "fire and forget" the notifications and work on persistence in a later stage coordinating with the product road-map needs.

[1] https://www.mediawiki.org/wiki/Wikimedia_Product_Infrastructure_team/Push_Notifications_Infrastructure/Design_Decisions#No_delivery_guarantees_for_v1

Sure, what I meant is that if this was built on top of MediaWiki jobs, you would get persistence "for free", because that's how you would transmit the information from the events to a latter job that does the batching. For the architecture choice it's important to see where this is going, or likely to be going, even if some features aren't part of the first phase requirements.

If you're going to stick with the node service, the issue of lack of batching delay under high load needs to be addressed anyway. One way to do this in the current setup would be to add a random delay to NotificationRequestJob via getReleaseTimestamp(). This would avoid reinventing the wheel in the service and have the same effect, with individual notifications spread randomly over time, regardless of batching spikes. If notification order is a concern, this requires a bit more work and you might have to store a cursor of some kind for the user. I.e. helping you ensure that the next fuzzed timestamp is always beyond the last notification fuzzed timestamp.

With this, you can actually get rid of the timestamp-based random dispatch logic in the service, as the notifications would already arrive to the service with random timing.

It's hard for me to figure out ahead of time how well this particular service is going to perform. I would advise doing some load testing on the beta instance, watching out for memory leaks and spikes in particular when the service gets a lot of queued notifications in memory. It might be a good idea to have a hard limit on that, with the service returning errors at individual notification intake when its in-memory queue gets too large. This would give the MediaWiki job feeding them a chance to retry later once the service has gone over the bump.

The scale of the load tests should be straightforward to determine based on the relevant Echo events happening in production at the moment. If that's not instrumented yet, then it should be ahead of the introduction of this new feature.

It would be nice to have a dashboard that shows:

  • Count of relevant Echo events (with or without notifications). This shows the potential volume and spikes if all users used notifications.
  • Time spent inside the notification job while sending notifications to the service
  • Error counts for failed service calls
  • Memory usage of the service (this will indirectly reveal OOMs and restarts)
  • Time spent while sending HTTP requests to the 3rd-party services
  • Error counts (and success counts) for requests to the 3rd-party services
  • Time spent by notifications in the service's in-memory queue
  • Total delta time between Echo event and when the notification was delivered tot he 3rd-party service. Some of that delay is deliberate, but it's important to catch outliers where notifications might get stuck for hours

Change 621031 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/push-notifications@master] Do not limit the queue size by default

https://gerrit.wikimedia.org/r/621031

Change 621031 merged by jenkins-bot:
[mediawiki/services/push-notifications@master] Do not limit the queue size by default

https://gerrit.wikimedia.org/r/621031

Change 618307 had a related patch set uploaded (by Jgiannelos; owner: Jgiannelos):
[mediawiki/services/push-notifications@master] Add metrics instrumentation for APNS/FCM

https://gerrit.wikimedia.org/r/618307

Change 618307 merged by jenkins-bot:
[mediawiki/services/push-notifications@master] Add metrics instrumentation for APNS/FCM

https://gerrit.wikimedia.org/r/618307

Since the follow-up tasks have been filed, this concludes the perf review. Thank you all!

Feel free to reopen one if significant changes happen in a later phase of the project, etc.

Thanks again for the review, Gilles.