Page MenuHomePhabricator

Explore an API for logging events sampled by session
Open, MediumPublic

Description

In several extensions (RelatedArticles, Page-Previews) we have moved to sampling logging by buckets determined from a stable session id, in order to consistently log all events in a user session. See T167236: userSessionToken in RelatedArticles schema does not seem to survive beyond one pageview for context, for example.

Things we duplicate across projects right now for sampling by user session:

  • Checking if window.navigator.sendBeacon is available
  • Calling mw.experiments with mw.user.sessionId and checking bucket to set the the logging on or off
  • Setting the logger to mw.track or noop or the schema sampling to 0 or 1 depending on the previous check.

Which is a bit of boilerplate. Nothing too serious but if this type of logging is going to be something common then it would be interesting to abstract it.

What

Explore different options for a unique API shared across mediawiki extensions to log events sampled by user session.

Some must haves

  • Configurable sampling rate
  • Initial events will be delivered (to log events like page/feature loaded)
  • No php conditional registration of ResourceLoader modules and async loading of such modules in the client

AC

  • List proposals of APIs w/ pros and cons
  • Discuss
  • Create follow up tasks to implement the API, and update existing code to use it

Proposals

1. Use Extension:WikimediaEvents with new event buses

  • Unconditionally log with mw.track
  • Implement two more buses in WikimediaEvents that take a configuration param:
    • sampled_event.session.<Schema> (config: Config, eventData: any)
      • buckets/samples based on mw.user.sessionId
      • do we need mw.experiments.getBucket({ name }) in the config or would the schema name be enough to derive a unique name?
    • sampled_event.page.<Schema> (config: Config, eventData: any)
      • samples based on mw.eventLog.inSample like mw.eventLog.Schema
    • Where Config: { samplingRate: BoundedNumber<0, 1> }
Examples
// Log event unconditionally
mw.track( 'event.Popups', { action: 'page_loaded' } )

// Log event sampled by page view
mw.track(
  'sampled_event.page.Popups',
  { samplingRate: 0.01 },
  { action: 'page_loaded' } )

// Log event sampled by user session
mw.track(
  'sampled_event.session.Popups',
  { samplingRate: 0.01 },
  { action: 'page_loaded' } )

2. Use Extension:WikimediaEvents with the existing bus

  • Unconditionally log with mw.track( 'event.<Schema>' )
  • Overload the event signature to be able to receive a second parameter that would be logging options
    • Now: event.<Schema> (eventData: any)
    • Proposed: event.<Schema> (eventData: any, [config: Config])
      • Optional config: Config: { samplingRate: BoundedNumber<0, 1>, sampleBy: SamplingStrategy }
        • SamplingStrategy: SESSION | PAGE
          • SESSION samples based on mw.user.sessionId
          • PAGE samples based on mw.eventLog.inSample like mw.eventLog.Schema
      • Do we need mw.experiments.getBucket({ name }) in the config or would the schema name be enough to derive a unique name?
Examples
// Log event unconditionally
mw.track( 'event.Popups', { action: 'page_loaded' } )

// Log event sampled by page view
mw.track(
  'event.Popups',
  { action: 'page_loaded' },
  { sampleBy: "PAGE", samplingRate: 0.01 } )

// Log event sampled by user session
mw.track(
  'event.Popups',
  { action: 'page_loaded' },
  { sampleBy: "SESSION", samplingRate: 0.01 } )

Event Timeline

Jhernandez updated the task description. (Show Details)

I've listed a proposal as an example. It may suck so comment or add other proposals to it.

Jdlrobson triaged this task as Medium priority.Jun 20 2017, 4:42 PM
Jdlrobson added a subscriber: Nuria.

Ah OK, I thought we had already decided to settle on a per-schema approach for now. But yes, there are good arguments for implementing a general solution too. Apropos, a third web schema that was meant to sample by browser session (but didn't, in that case) was MobileWebSectionUsage. See the May 2016 discussion at T128931#2283390 (comments by myself, @dr0ptp4kt and @Jdlrobson - unfortunately in that case it was too late to re-run the experiment and we had to abandon part of the planned analysis).

CCing @EBernhardson and @mpopov because they have experience and perspective from somewhat similar situations in Discovery (not using browser sessions, but search sessions - still, I guess there are parallels).

We have settled on a per-schema approach for now but would like to have a general solution long term given the amount of pain we have setting these up each time.

To be clear, the approach listed on WikimediaEvents is per schema.

An event would be emitted to a bus with the schema name like we already do:

mw.track( 'sampled_event.session.Popups', { samplingRate: 0.01 }, { action: 'pageLoaded' } )
// or
mw.track(
  'sampled_event.session.Popups',
  { samplingRate: mw.config.get( 'wgPopupsSchemaSamplingRate' ) },
  { action: 'pageLoaded' } )

This proposal adds two different channels for logging sampled, by page view or by session, allowing to receive some configuration for the sampling purposes.

The existing channel event.<Schema> logs unconditionally, which is useful, but we need this other two ways of logging.

Example of what we already do (by duplicating the sampling setup and logic across projects, which is what we want to fix):

// get sampling rate and bucket users with mw.experiments boilerplate
var samplingRate = mw.config.get( 'wgPopupsSchemaSamplingRate' )
var enabled = mw.experiments.getBucket( {
	enabled: true,
	name: 'ext.popups.logging',
	buckets: {
		'enabled': samplingRate,
		'disabled': 1 - samplingRate
	}
}, mw.user.sessionId() ) === 'enabled'
var log = enabled ? mw.track : $.noop

// Somewhere else where we want to log
log(
  'event.Popups',
  { action: 'pageLoaded' } )

I don't like the sampled_event prefix name, but event already grabs everything after it as the schema name event.<Schema> so if we were to add the sub namespaces to that name it would be a breaking change (event.<sampled> would collide with event.<schemaName>).

Could we please clarify what is the overall goal here cause I am not sure I understand. An experiment and a schema is not the same thing as an experiment might include various schemas. Is the idea to set up an experiment to be sticky (or not) and after just instrument all the code and let the calls themselves decide whether the event needs to be sent? (I might be totally off here)

Some side notes:

Checking if window.navigator.sendBeacon is available

Whether send beacon is not available has little to do with sampling, if not available events will be logged the old fashion way (image beacon). I am not sure that code is needed at all, only benefit is getting logging on page transitions for some IE users for which the logging might not happen otherwise.

To be clear, the approach listed on WikimediaEvents is per schema.

The approach is per "experiment", right? You could group several schemas under an experiment, correct? An experiment is just a string identifier.

Could we please clarify what is the overall goal here cause I am not sure I understand.

Sure. We are changing event sampling from page to be by user session instead in a few projects from Reading Web.

We are duplicating code and logic for sampling by user session.

This task is about exploring a unifying API for logging events sampled by session so that we don't duplicate setup code and sampling logic across projects that much.

An experiment and a schema is not the same thing as an experiment might include various schemas. Is the idea to set up an experiment to be sticky (or not) and after just instrument all the code and let the calls themselves decide whether the event needs to be sent? (I might be totally off here)

The experiment wording in the description in previous comments, is because mw.experiments.getBucket gets a name. Hence mentions to experiment name.

This task has nothing to do with A/B tests or experiments in the analytics sense. It is just about coming up with an API for logging events sampled by session.

Sorry for the confusion, I'll update the title and description.

Some side notes:

Checking if window.navigator.sendBeacon is available

Whether send beacon is not available has little to do with sampling, if not available events will be logged the old fashion way (image beacon). I am not sure that code is needed at all, only benefit is getting logging on page transitions for some IE users for which the logging might not happen otherwise.

I believe that is the reason. Since we would sample events by user session, I understand we want to get all events in that user session and not lose them, so sendBeacon helps with that.

Nothing against not doing this, we can filter events by browser too so if it is IE that loses events we can exclude it at analysis time if necessary.

To be clear, the approach listed on WikimediaEvents is per schema.

The approach is per "experiment", right? You could group several schemas under an experiment, correct? An experiment is just a string identifier.

No mention of experiments, just sampling events by page view or user session using the event buses.

Let me know if it makes sense.

Jhernandez renamed this task from Spike: Explore a unifying API for logging by session to Spike: Explore an API for logging events sampled by session.Jun 28 2017, 9:33 AM
Jhernandez updated the task description. (Show Details)

I've clarified title and description, removed the experiment word from the description.


I think there is another option with the existing event.<Schema> event bus, by overloading the existing signature with arity 2 and having the second parameter be the logging options.

I'll flesh it out now.

Does Wikimedia-Events have a project?

Jhernandez raised the priority of this task from Low to Medium.May 16 2018, 3:04 PM
Jhernandez renamed this task from Spike: Explore an API for logging events sampled by session to Explore an API for logging events sampled by session.May 16 2018, 3:07 PM

I think this might be of interest and I am not sure if when this ticket was created this code existed on EL but you can sample per session consistantly like:

// Consistent sampling within browser session
mw.eventLog.randomTokenMatch( num popSize, str token )

Link to source:

https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/ab184351ede28dfaf12e3ad367d53a556be69c9d/modules/ext.eventLogging.subscriber.js#L65

Thanks! interesting to see.

The helper seems to be there, the issue with this approach is that the extensions then have to depend unconditionally on the eventLogging extension, which we were avoiding by just using mw.track. This makes every extension that needs to work with MediaWiki and Wikimedia more complex and potentially less performant, as the non-existence of the EventLogging extension needs to be considered at every point we want to use any of its helpers, and they can't be depended upon unconditionally on extension.json or the extension would break on 3rd parties. As such they need to be lazy loaded with mw.loader potentially delaying the bootstrapping of code and making the feature less performant.

When we use mw.track, events get messaged to the bus by the extension, which can go about its business without worrying about other extensions, and then when trackSubscribe happens on the other side it gets all the events at once to act on. This makes for a great separation of concerns specially for mediawiki extensions that need to work regardless of the eventLogging or the wikimediaEvents extension being present.

Maybe we can upstream these kind of helpers to core so that they can be relied on by extensions without having to depend on other extensions.

El will be simplified and from that moment onwards there should not be any performance issues depending on it: https://phabricator.wikimedia.org/T187207

Jdlrobson renamed this task from Explore an API for logging events sampled by session to [EPIC] Explore an API for logging events sampled by session.Jun 12 2018, 4:27 PM
Jdlrobson added a project: Epic.

This feels like a good technical project relating to beta data, but it would need some coordination and discussion among a bunch of teams but not something we can commit to right now.

Jhernandez renamed this task from [EPIC] Explore an API for logging events sampled by session to Explore an API for logging events sampled by session.Jul 6 2018, 12:49 PM
Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:33 AM
Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:36 AM

I believe this was addressed by the addition of mw.eventLog.inSample ?

I believe this was addressed by the addition of mw.eventLog.inSample ?

There are currently two mechanisms:

I believe this was addressed by the addition of mw.eventLog.inSample ?

We have mw.eventLog.eventInSample() and mw.eventLog.sessionInSample(), which determines in-sample/out-sample based on the pageview token and session token respectively; and we have mw.eventLog.randomTokenMatch(), which determines in-sample/out-sample based on the given token.

Per https://wiki.c2.com/?ApiVsProtocol:

An API provides a library that you must link with to use the services. This tightly binds the client and server together. The API tends invade all code layers and creates massive dependencies between layers. It also tends to be simple to use.

A protocol defines a standard request response layer and a common transport. Nothing other than the standard binds the client and server together. Protocols are more complex to use as they are less direct and take a lot of serializing/deserializing/dispatching type logic.

We do have an API and a protocol for submitting events:

API: mw.eventLog.submit( 'Foo' )
Protocol: mw.track( 'event.Foo' )

NOTE: There's a little work to do in updating the protocol so that it works with EP streams as well as legacy EventLogging streams.

We do have an API for sampling but we don't have a protocol. I think this is what @Jhernandez was driving at in T168380#3366956 all those years ago. We also don't know if there's a need for a protocol.

I propose that:

  1. We close this task as Resolved because it's about an API for sampling (per the task's title), which we definitely have
  2. We create two tasks:
    1. A task to determine whether a protocol for sampling is needed
    2. A task to determine what the protocol could look like, with the the initial proposal being that in T168380#3366956

If we do determine that there's a need, then we can create a task to implement the protocol.

@phuedx: Some extra details as you follow up on the protocol part. Here are some thoughts on the topic from me and Jason: https://docs.google.com/document/d/1cZwIr6CsFtf6YZjfwr8xjscYN2f2ZhX9eAskiL1UePo/edit

We ended up adopting the algorithm used in Search Satisfaction instrumentation (removed as the instrument switched to 100% in-sample)

You can see it's the same algorithm as used in Event Platform client for iOS (https://github.com/wikimedia/wikipedia-ios/blob/1fedbef5975ed34595c9f7daf4dfbabb74401318/WMF%20Framework/Event%20Platform/SamplingController.swift#L84-L103) and for Android (https://github.com/wikimedia/apps-android-wikipedia/blob/eb905f179b2efb2cb27a564e744b088263010688/app/src/main/java/org/wikipedia/analytics/eventplatform/EventPlatformClient.kt#L283-L315), which is why we made sure to have the session IDs have the same spec as https://doc.wikimedia.org/mediawiki-core/master/js/#!/api/mw.user-method-generateRandomSessionId

It's a great algorithm because it has the properties we want:

  • Specifying sampling as a % instead of a "1 in X" method
  • A session that's in-sample at 1% is also in-sample at 2%, 5%, etc
  • Uniform distribution

I did a simulation study and tested the algorithm to verify these properties.