Page MenuHomePhabricator

Explore an API for logging events sampled by session
Open, NormalPublic

Description

In several extensions (RelatedArticles, Page-Previews) we have moved to sampling logging by buckets determined from a stable session id, in order to consistently log all events in a user session. See T167236: userSessionToken in RelatedArticles schema does not seem to survive beyond one pageview for context, for example.

Things we duplicate across projects right now for sampling by user session:

  • Checking if window.navigator.sendBeacon is available
  • Calling mw.experiments with mw.user.sessionId and checking bucket to set the the logging on or off
  • Setting the logger to mw.track or noop or the schema sampling to 0 or 1 depending on the previous check.

Which is a bit of boilerplate. Nothing too serious but if this type of logging is going to be something common then it would be interesting to abstract it.

What

Explore different options for a unique API shared across mediawiki extensions to log events sampled by user session.

Some must haves

  • Configurable sampling rate
  • Initial events will be delivered (to log events like page/feature loaded)
  • No php conditional registration of ResourceLoader modules and async loading of such modules in the client

AC

  • List proposals of APIs w/ pros and cons
  • Discuss
  • Create follow up tasks to implement the API, and update existing code to use it

Proposals

1. Use Extension:WikimediaEvents with new event buses

  • Unconditionally log with mw.track
  • Implement two more buses in WikimediaEvents that take a configuration param:
    • sampled_event.session.<Schema> (config: Config, eventData: any)
      • buckets/samples based on mw.user.sessionId
      • do we need mw.experiments.getBucket({ name }) in the config or would the schema name be enough to derive a unique name?
    • sampled_event.page.<Schema> (config: Config, eventData: any)
      • samples based on mw.eventLog.inSample like mw.eventLog.Schema
    • Where Config: { samplingRate: BoundedNumber<0, 1> }
Examples
// Log event unconditionally
mw.track( 'event.Popups', { action: 'page_loaded' } )

// Log event sampled by page view
mw.track(
  'sampled_event.page.Popups',
  { samplingRate: 0.01 },
  { action: 'page_loaded' } )

// Log event sampled by user session
mw.track(
  'sampled_event.session.Popups',
  { samplingRate: 0.01 },
  { action: 'page_loaded' } )

2. Use Extension:WikimediaEvents with the existing bus

  • Unconditionally log with mw.track( 'event.<Schema>' )
  • Overload the event signature to be able to receive a second parameter that would be logging options
    • Now: event.<Schema> (eventData: any)
    • Proposed: event.<Schema> (eventData: any, [config: Config])
      • Optional config: Config: { samplingRate: BoundedNumber<0, 1>, sampleBy: SamplingStrategy }
        • SamplingStrategy: SESSION | PAGE
          • SESSION samples based on mw.user.sessionId
          • PAGE samples based on mw.eventLog.inSample like mw.eventLog.Schema
      • Do we need mw.experiments.getBucket({ name }) in the config or would the schema name be enough to derive a unique name?
Examples
// Log event unconditionally
mw.track( 'event.Popups', { action: 'page_loaded' } )

// Log event sampled by page view
mw.track(
  'event.Popups',
  { action: 'page_loaded' },
  { sampleBy: "PAGE", samplingRate: 0.01 } )

// Log event sampled by user session
mw.track(
  'event.Popups',
  { action: 'page_loaded' },
  { sampleBy: "SESSION", samplingRate: 0.01 } )

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 20 2017, 10:37 AM
Jhernandez updated the task description. (Show Details)

I've listed a proposal as an example. It may suck so comment or add other proposals to it.

phuedx removed a subscriber: phuedx.
phuedx added a subscriber: phuedx.

^ lol

Jdlrobson triaged this task as Normal priority.Jun 20 2017, 4:42 PM
Jdlrobson added a subscriber: Nuria.

Ah OK, I thought we had already decided to settle on a per-schema approach for now. But yes, there are good arguments for implementing a general solution too. Apropos, a third web schema that was meant to sample by browser session (but didn't, in that case) was MobileWebSectionUsage. See the May 2016 discussion at T128931#2283390 (comments by myself, @dr0ptp4kt and @Jdlrobson - unfortunately in that case it was too late to re-run the experiment and we had to abandon part of the planned analysis).

CCing @EBernhardson and @mpopov because they have experience and perspective from somewhat similar situations in Discovery (not using browser sessions, but search sessions - still, I guess there are parallels).

We have settled on a per-schema approach for now but would like to have a general solution long term given the amount of pain we have setting these up each time.

To be clear, the approach listed on WikimediaEvents is per schema.

An event would be emitted to a bus with the schema name like we already do:

mw.track( 'sampled_event.session.Popups', { samplingRate: 0.01 }, { action: 'pageLoaded' } )
// or
mw.track(
  'sampled_event.session.Popups',
  { samplingRate: mw.config.get( 'wgPopupsSchemaSamplingRate' ) },
  { action: 'pageLoaded' } )

This proposal adds two different channels for logging sampled, by page view or by session, allowing to receive some configuration for the sampling purposes.

The existing channel event.<Schema> logs unconditionally, which is useful, but we need this other two ways of logging.

Example of what we already do (by duplicating the sampling setup and logic across projects, which is what we want to fix):

// get sampling rate and bucket users with mw.experiments boilerplate
var samplingRate = mw.config.get( 'wgPopupsSchemaSamplingRate' )
var enabled = mw.experiments.getBucket( {
	enabled: true,
	name: 'ext.popups.logging',
	buckets: {
		'enabled': samplingRate,
		'disabled': 1 - samplingRate
	}
}, mw.user.sessionId() ) === 'enabled'
var log = enabled ? mw.track : $.noop

// Somewhere else where we want to log
log(
  'event.Popups',
  { action: 'pageLoaded' } )

I don't like the sampled_event prefix name, but event already grabs everything after it as the schema name event.<Schema> so if we were to add the sub namespaces to that name it would be a breaking change (event.<sampled> would collide with event.<schemaName>).

Nuria added a comment.EditedJun 27 2017, 8:48 PM

Could we please clarify what is the overall goal here cause I am not sure I understand. An experiment and a schema is not the same thing as an experiment might include various schemas. Is the idea to set up an experiment to be sticky (or not) and after just instrument all the code and let the calls themselves decide whether the event needs to be sent? (I might be totally off here)

Some side notes:

Checking if window.navigator.sendBeacon is available

Whether send beacon is not available has little to do with sampling, if not available events will be logged the old fashion way (image beacon). I am not sure that code is needed at all, only benefit is getting logging on page transitions for some IE users for which the logging might not happen otherwise.

To be clear, the approach listed on WikimediaEvents is per schema.

The approach is per "experiment", right? You could group several schemas under an experiment, correct? An experiment is just a string identifier.

Could we please clarify what is the overall goal here cause I am not sure I understand.

Sure. We are changing event sampling from page to be by user session instead in a few projects from Reading Web.

We are duplicating code and logic for sampling by user session.

This task is about exploring a unifying API for logging events sampled by session so that we don't duplicate setup code and sampling logic across projects that much.

An experiment and a schema is not the same thing as an experiment might include various schemas. Is the idea to set up an experiment to be sticky (or not) and after just instrument all the code and let the calls themselves decide whether the event needs to be sent? (I might be totally off here)

The experiment wording in the description in previous comments, is because mw.experiments.getBucket gets a name. Hence mentions to experiment name.

This task has nothing to do with A/B tests or experiments in the analytics sense. It is just about coming up with an API for logging events sampled by session.

Sorry for the confusion, I'll update the title and description.

Some side notes:

Checking if window.navigator.sendBeacon is available

Whether send beacon is not available has little to do with sampling, if not available events will be logged the old fashion way (image beacon). I am not sure that code is needed at all, only benefit is getting logging on page transitions for some IE users for which the logging might not happen otherwise.

I believe that is the reason. Since we would sample events by user session, I understand we want to get all events in that user session and not lose them, so sendBeacon helps with that.

Nothing against not doing this, we can filter events by browser too so if it is IE that loses events we can exclude it at analysis time if necessary.

To be clear, the approach listed on WikimediaEvents is per schema.

The approach is per "experiment", right? You could group several schemas under an experiment, correct? An experiment is just a string identifier.

No mention of experiments, just sampling events by page view or user session using the event buses.

Let me know if it makes sense.

Jhernandez renamed this task from Spike: Explore a unifying API for logging by session to Spike: Explore an API for logging events sampled by session.Jun 28 2017, 9:33 AM
Jhernandez updated the task description. (Show Details)

I've clarified title and description, removed the experiment word from the description.


I think there is another option with the existing event.<Schema> event bus, by overloading the existing signature with arity 2 and having the second parameter be the logging options.

I'll flesh it out now.

Jhernandez updated the task description. (Show Details)Jun 28 2017, 9:45 AM
Jhernandez updated the task description. (Show Details)Jun 28 2017, 9:53 AM
Jdlrobson lowered the priority of this task from Normal to Low.Jul 6 2017, 7:19 PM

@Nuria does the goal make sense now..?

Restricted Application added a project: Analytics. · View Herald TranscriptFeb 14 2018, 6:55 PM

Does Wikimedia-Events have a project?

Jdlrobson moved this task from Later to Blocked on the Readers-Web-Backlog (Tracking) board.
mforns added a subscriber: mforns.Feb 22 2018, 6:32 PM
fdans moved this task from Incoming to Radar on the Analytics board.Feb 22 2018, 6:32 PM
Jhernandez raised the priority of this task from Low to Normal.May 16 2018, 3:04 PM
Jhernandez renamed this task from Spike: Explore an API for logging events sampled by session to Explore an API for logging events sampled by session.May 16 2018, 3:07 PM
Nuria added a comment.EditedMay 17 2018, 6:25 PM

I think this might be of interest and I am not sure if when this ticket was created this code existed on EL but you can sample per session consistantly like:

// Consistent sampling within browser session
mw.eventLog.randomTokenMatch( num popSize, str token )

Link to source:

https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/ab184351ede28dfaf12e3ad367d53a556be69c9d/modules/ext.eventLogging.subscriber.js#L65

Thanks! interesting to see.

The helper seems to be there, the issue with this approach is that the extensions then have to depend unconditionally on the eventLogging extension, which we were avoiding by just using mw.track. This makes every extension that needs to work with MediaWiki and Wikimedia more complex and potentially less performant, as the non-existence of the EventLogging extension needs to be considered at every point we want to use any of its helpers, and they can't be depended upon unconditionally on extension.json or the extension would break on 3rd parties. As such they need to be lazy loaded with mw.loader potentially delaying the bootstrapping of code and making the feature less performant.

When we use mw.track, events get messaged to the bus by the extension, which can go about its business without worrying about other extensions, and then when trackSubscribe happens on the other side it gets all the events at once to act on. This makes for a great separation of concerns specially for mediawiki extensions that need to work regardless of the eventLogging or the wikimediaEvents extension being present.

Maybe we can upstream these kind of helpers to core so that they can be relied on by extensions without having to depend on other extensions.

Nuria added a comment.May 31 2018, 6:35 AM

El will be simplified and from that moment onwards there should not be any performance issues depending on it: https://phabricator.wikimedia.org/T187207

Jdlrobson renamed this task from Explore an API for logging events sampled by session to [EPIC] Explore an API for logging events sampled by session.Jun 12 2018, 4:27 PM
Jdlrobson added a project: Epic.

This feels like a good technical project relating to beta data, but it would need some coordination and discussion among a bunch of teams but not something we can commit to right now.

Jhernandez renamed this task from [EPIC] Explore an API for logging events sampled by session to Explore an API for logging events sampled by session.Jul 6 2018, 12:49 PM