Page MenuHomePhabricator

Go over analytics events and make sure they're doing what we think they're doing
Closed, ResolvedPublic10 Estimated Story Points

Event Timeline

Charlotte renamed this task from Go over analytics events and make sure they're doing what we think they're doing] to Go over analytics events and make sure they're doing what we think they're doing.Feb 13 2018, 7:39 PM

@Dbrant @Sharvaniharan @cooltey: I'm trying to figure out the different sampling configurations as part of this audit. Can one of you please review my guesses? Thanks!

/** 1% chance of search action being logged regardless of whether the user as a whole is tracked?
  * Or 1% chance of search action being logged ONLY IF the appInstallId was selected for logging?
  * Or 1% chance of search session being logged if the appInstallId was selected? */
public SearchFunnel(WikipediaApp app, SearchInvokeSource source) {
    super(app, SCHEMA_NAME, REVISION, Funnel.SAMPLE_LOG_100);
    this.source = source;
}
/* Funnel.SAMPLE_LOG_ALL sampling by default, right? Also channel info is missing :( Phab task incoming... */
public InstallReferrerFunnel(WikipediaApp app) {
    super(app, SCHEMA_NAME, REV_ID);
}
/* All 'On This Day' events logged IF the appInstallId was selected? */
public OnThisDayFunnel(WikipediaApp app, WikiSite wiki, int source) {
    super(app, SCHEMA_NAME, REV_ID, Funnel.SAMPLE_LOG_ALL, wiki);
    this.source = source;
}
/* If prod release then every page scroll summary event (one per page) has 1% chance of being logged? Otherwise all scroll summary events are logged? */
public PageScrollFunnel(WikipediaApp app, int pageId) {
    super(app, SCHEMA_NAME, REV_ID, ReleaseUtil.isProdRelease() ? Funnel.SAMPLE_LOG_100 : Funnel.SAMPLE_LOG_ALL);
    this.pageId = pageId;
}

To clarify, the sampling logic in the app works like this:

  • When the app is installed, a unique appInstallId is generated and saved, which is a random UUID. (this will identify this "user" across different schemas)
  • To decide whether a certain funnel's events are sent or not, we take the last digits of the appInstallId, and if those digits equal zero, modulo the sample rate (e.g. SAMPLE_LOG_100), then the events from that funnel will be sent. Otherwise the funnel will be silent.

That's basically it; there's no other random selection at work. This has the following implications:

  • If the appInstallId = 0 mod 100, then all funnels with SAMPLE_LOG_100 will be enabled, as well as all funnels with SAMPLE_LOG_10.
  • If the appInstallId = 0 mod 10, then all funnels with SAMPLE_LOG_10 will be enabled, but not necessarily funnels with SAMPLE_LOG_100.
  • Funnels with SAMPLE_LOG_ALL are always enabled, since appInstallId mod 1 is always 0.

Therefore, the interpretation of your comments would be:

/** 1% chance of search action being logged regardless of whether the user as a whole is tracked?
  * Or 1% chance of search action being logged ONLY IF the appInstallId was selected for logging?
  * Or 1% chance of search session being logged if the appInstallId was selected? */
public SearchFunnel(WikipediaApp app, SearchInvokeSource source) {
    super(app, SCHEMA_NAME, REVISION, Funnel.SAMPLE_LOG_100);
    this.source = source;
}

There's no specific concept of "the user as a whole is tracked". The parameter of SAMPLE_LOG_100 is precisely what decides whether this funnel's events will be sent for the current user.

/* Funnel.SAMPLE_LOG_ALL sampling by default, right? Also channel info is missing :( Phab task incoming... */
public InstallReferrerFunnel(WikipediaApp app) {
    super(app, SCHEMA_NAME, REV_ID);
}

Correct; all of these events are sent for all users. (Will reply in separate task regarding channel info.)

/* All 'On This Day' events logged IF the appInstallId was selected? */
public OnThisDayFunnel(WikipediaApp app, WikiSite wiki, int source) {
    super(app, SCHEMA_NAME, REV_ID, Funnel.SAMPLE_LOG_ALL, wiki);
    this.source = source;
}

Similar to the above, all of this funnel's events are sent for all users.

/* If prod release then every page scroll summary event (one per page) has 1% chance of being logged? Otherwise all scroll summary events are logged? */
public PageScrollFunnel(WikipediaApp app, int pageId) {
    super(app, SCHEMA_NAME, REV_ID, ReleaseUtil.isProdRelease() ? Funnel.SAMPLE_LOG_100 : Funnel.SAMPLE_LOG_ALL);
    this.pageId = pageId;
}

That is correct; this funnel is unsampled in non-production builds, but switches to 1:100 for production.

To clarify, the sampling logic in the app works like this:

  • When the app is installed, a unique appInstallId is generated and saved, which is a random UUID. (this will identify this "user" across different schemas)
  • To decide whether a certain funnel's events are sent or not, we take the last digits of the appInstallId, and if those digits equal zero, modulo the sample rate (e.g. SAMPLE_LOG_100), then the events from that funnel will be sent. Otherwise the funnel will be silent.

That's basically it; there's no other random selection at work. This has the following implications:

  • If the appInstallId = 0 mod 100, then all funnels with SAMPLE_LOG_100 will be enabled, as well as all funnels with SAMPLE_LOG_10.
  • If the appInstallId = 0 mod 10, then all funnels with SAMPLE_LOG_10 will be enabled, but not necessarily funnels with SAMPLE_LOG_100.
  • Funnels with SAMPLE_LOG_ALL are always enabled, since appInstallId mod 1 is always 0.

Thank you for the clarification, @Dbrant! So in summary, if:

  • N is the number of users who have "send usage reports" turned on
  • we assume a uniform distribution of the number formed by the last digits of appInstallId
  • F is a funnel

…then:

  • In case of SAMPLE_LOG_ALL, F is activated for all of those N users
  • In case of SAMPLE_LOG_10, F is activated for approx. 10% of those N users
  • In case of SAMPLE_LOG_100, F is activated for approx. 1% of those N users

Correct?

mpopov triaged this task as High priority.
mpopov set the point value for this task to 10.
mpopov updated the task description. (Show Details)
mpopov moved this task from Backlog to Needs review on the Discovery-Analysis (Current work) board.

@Charlotte: I'm going to share the spreadsheet with you and Dmitry. Once you review, can you please comment on this ticket so we know whether to move it into our Done column? ta!

/* Funnel.SAMPLE_LOG_ALL sampling by default, right? Also channel info is missing :( Phab task incoming... */
public InstallReferrerFunnel(WikipediaApp app) {
    super(app, SCHEMA_NAME, REV_ID);
}

Correct; all of these events are sent for all users. (Will reply in separate task regarding channel info.)

@Dbrant: by the way, that task is T188146 and now I'm wondering if it should be included in the hot pepper sprint just so we can start recording that info sooner than later? But also it's not high priority.

Great work! Are we going to document findings about each schema on that schema's documentation page (or associated talk page) too?

Great work! Are we going to document findings about each schema on that schema's documentation page (or associated talk page) too?

Thank you! And sort of! The findings that are actual problems (and maybe other things) will be turned into Phab tasks and I'm thinking that everything else (things like the offline library events having packList="") would go on the talk pages.