Page MenuHomePhabricator

Define cross-schema event stitching approach
Open, Needs TriagePublic

Description

THIS TASK DEFINITION IS A WORK IN PROGRESS

Problem definition

Better Use of Data program output 3.1 requirements analysis identified the following needs:

  • Complex workflows that contain multiple events should be instrumented in such a way that the multiple events can be gracefully stitched together to re-compose the sequence of events undertaken by the user.
  • Events from otherwise unrelated features can be stitched together via common identifiers. For instrumentations that use sampling, this requires the ability to apply the same sampling to multiple events.

This task will be used to collaboratively define the mechanics for meeting these needs.

Notes

  • This topic focuses on the Wikimedia production content projects. It is not inclusive of donate.wikimedia.org and donation banners.
  • This is in draft, and so verbiage is likely to change.
  • Generally, logged data fields should be constructed in the least identifying way possible. For example, if it is sufficient to use the edit bucket count of a user in order to perform longitudinal analysis on a user cohort instead of using user IDs, the edit bucket count should be used. As another example, if it is sufficient to use the namespace for articles instead of actual article titles, the namespace for articles should be used.

Use Cases

Broadly speaking, there are five cases of event logging. Here are the five cases and their correlation approaches.

1. Users who have opted into data collection explicitly. For such users, correlation may be done with a fixed identifier and potentially all event logging may be done without sampling. In the apps this is usually an app installation ID. On the web this would most likely be a similar type of value stored in a localStorage variable (n.b., this is not presently being entertained on the web (e.g., for narrowly scoped longitudinal analysis for a small random sample of users)). Data should be handled to comply with the data retention guidelines for data persisting beyond 90 days.

2. Users who are opted out of data collection explicitly. Users on the web indicate this via Do Not Track. Users on the apps are opted out of data collection by default. For such users, event logging should not take place and, therefore, besides standard tracing behavior on web logs or MediaWiki database inserts and the collection in #5 below, correlation would typically be out of scope. (N.B., Virtual Pageviews and other intentful impressons may use the same transport used for event collection.)

3. Users who have neither opted in nor opted out explicitly - the largest base of users. For new event logging schemas, event logging on the per-session basis would be available with a boolean flag for the systemwide specified default. Recommendation: 1 in 10,000 sessions on the web. What this means is that there's a 1 in 10,000 chance that, for all event logging using the boolean flag, such events in that session will be captured. Higher sampling ratios on a per-wiki or per-feature basis when needed to account for a specific purpose (e.g., to address issues with data sparsity) may be established in consultation with Privacy on a case-by-case basis.

4. Users who login or make contributions may have two classes of correlation applied:

4.1 In-feature unsampled contribution and persisted user data behavior. Contribution feature and persisted user data behavior may be tracked in an unsampled fashion and may include (although does not require) user IDs (or, upon contribution from anonymous access, masked IP addresses). As a side effect, such behavior may be stitched with events in #3.

4.2. Out-of-feature behavior. Such behavior may on a case-by-case basis, in consultation with Privacy, be captured on an unsampled basis. Such behavior to be captured must be clearly scoped and should be captured in a way that proactively avoids easy linking of identity to consumption habits.

5. Security, privacy, and error events. Such events may be collected in an unsampled fashion on Wikimedia infrastructure and correlated with any other collected events.

Technical Specifics

TODO: define session identifier and any hashing technique
TODO: define default and by-enrollment correlation configuration scheme

Related

T201124: Provide standard/reproducible way to access a PageToken
T199898: EventLogging sanitization
T201409: Harmonise the identification of requests across our stack
Data retention guidelines
Privacy policy
Instrumentation DACI

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 26 2018, 6:53 PM
Ottomata added a subscriber: Nuria.Oct 10 2018, 4:14 PM
dr0ptp4kt updated the task description. (Show Details)Oct 10 2018, 4:53 PM
dr0ptp4kt updated the task description. (Show Details)Oct 10 2018, 4:57 PM
dr0ptp4kt updated the task description. (Show Details)
phuedx added a subscriber: phuedx.Oct 10 2018, 5:12 PM
Nuria added a comment.Oct 10 2018, 6:01 PM

I am not clear on whether the "Define cross-schema event correlation approach" is for a cross-schema data to remain after 90 days, if so, how is that in agreement with the Data retention guidelines? (seems like it could not possibly be) and If we want to cross relate schemas for just 90 days, do we really need anything beyond session id?

dr0ptp4kt updated the task description. (Show Details)Oct 10 2018, 6:04 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt updated the task description. (Show Details)

I am not clear on whether the "Define cross-schema event correlation approach" is for a cross-schema data to remain after 90 days, if so, how is that in agreement with the Data retention guidelines? (seems like it could not possibly be) and If we want to cross relate schemas for just 90 days, do we really need anything beyond session id?

My understanding has been that this task is largely separate from the question which of the resulting data can be kept beyond 90 days. I would expect we will receive guidance from the Legal team (or in the future, Privacy) regarding this question, and that this guidance would depend on the specific data being logged (e.g., whether it contains page names or not). @dr0ptp4kt , can you clarify the scope?

Nuria added a comment.Oct 10 2018, 8:22 PM

the Legal team (or in the future, Privacy)

Not sure what Privacy refers to here, can you clarify a bit?

the Legal team (or in the future, Privacy)

Not sure what Privacy refers to here, can you clarify a bit?

This is a term from the instrumentation DACI, it's perhaps useful to get familiar with that first (and then work on any necessary clarifications there).

The understanding from @Tbayer is correct about this task being separate from the question on retention beyond 90 days.

@Nuria I believe the session ID is the right token for correlation in general.

dr0ptp4kt updated the task description. (Show Details)Oct 11 2018, 10:41 AM
dr0ptp4kt updated the task description. (Show Details)Oct 11 2018, 10:45 AM
dr0ptp4kt updated the task description. (Show Details)Oct 11 2018, 10:49 AM
dr0ptp4kt updated the task description. (Show Details)Oct 12 2018, 5:00 PM
dr0ptp4kt updated the task description. (Show Details)Oct 12 2018, 5:03 PM
dr0ptp4kt added subscribers: Gilles, Krinkle.EditedOct 12 2018, 5:55 PM

I'm interested in filling out the TODOs here.

TODO: define session identifier and any hashing technique
TODO: define default and by-enrollment correlation configuration scheme

First, my impression is that mw.user.sessionId() as the basis for determining whether to send events in a session for schemas using this approach (presumably new schemas by default) would be fine. However, does the raw value of mw.user.sessionId() sufficiently avoid collisions for longitudinal analysis or should it be salted (and possibly hashed) with something before being included in logged events? @Tbayer @Krinkle @Gilles I believe you were discussing this in one of the related tickets.

Second, @Ottomata @Jhernandez @phuedx any thoughts on the correlation configuration scheme? In an ideal world, I would hope that client implementers could fire and forget and have their events end up in session graphs for new schemas (implication: client JS, server PHP, and potentially iOS Swift & Android Java/Kotlin libraries - if they were using session sampling - will need to port code). So I think there would be three cases:

  1. Default: It just works. The value of mw.user.sessionId() is sent in the event.
  2. Not using session sampling: set optional boolean of isSessionSampled = false. The value of mw.user.sessionId() is not sent in the event.
  3. Using different level of session sampling: set optional double precision number of customSessionSamplingRateForFeature = (0.00,1.00]. This would add a non-null customFeatureSessionSamplingRateForFeature field to the event itself.

Third, I believe #1 and #2 from the previous point are straightforward in most respects. #3 is perhaps more complex, at least if it's important that other events in the session for a session that's considered in scope (i.e., those from #1) intersect with event types for which the customSessionSamplingRate is included. Any thoughts on how to make this work? I think in general it would be best that clients are encouraged to simply use #1 or #2, but can't rule out the possibility of the need for #3 (indeed it would seem most useful for A/B testing and even simple analysis when things are first rolled out to small wikis).

dr0ptp4kt updated the task description. (Show Details)Oct 12 2018, 5:58 PM

(BTW, not sure if you've seen T205319: Modern Event Platform: Stream Configuration, but I think the use cases there overlap. We won't be working on that task for a while, but we do want to do it.)

  1. Default: It just works. The value of mw.user.sessionId() is sent in the event.
  2. Not using session sampling: set optional boolean of isSessionSampled = false. The value of mw.user.sessionId() is not sent in the event.

Could we invert this? I think it would be better to not collect the sessionId unless the event designer explicitly wants it.

This would add a non-null customFeatureSessionSamplingRateForFeature field to the event itself.

This is a good idea, and something we've talked about before in T205319.

In T205319 folks (Leila, if I remember) really wanted stronger control on how events are sampled, not just on a hashed sessionId. I don't fully understand this use case, but it had been mentioned.

The client side can do whatever it needs to do. Hopefully a Stream Configuration Service will make it easy to set things like sampling rate and sampling field and salt etc. for a given stream. For now this stuff would continue to be put in some deployed Mediawiki config file.

(BTW, not sure if you've seen T205319: Modern Event Platform: Stream Configuration, but I think the use cases there overlap. We won't be working on that task for a while, but we do want to do it.)

  1. Default: It just works. The value of mw.user.sessionId() is sent in the event.
  2. Not using session sampling: set optional boolean of isSessionSampled = false. The value of mw.user.sessionId() is not sent in the event.

Could we invert this? I think it would be better to not collect the sessionId unless the event designer explicitly wants it.

💯


I believe that this too was mentioned in the context of T205319: Modern Event Platform: Stream Configuration but there may be instances where you have to exclude users from the instrumentation outright prior to determining whether they're in the sample, e.g. in both the Page Previews and ReadingDepth A/B tests we didn't instrument client sessions in UAs that didn't support the Beacon API (i.e. navigator.sendBeacon !== undefined). Whatever solution we land will need to be flexible enough to account for this.

dr0ptp4kt updated the task description. (Show Details)Oct 18 2018, 3:53 AM

@Ottomata I agree with @phuedx on your question that opt-in (eventually via SCS) makes sense. After all, for feature teams or feature clusters where session sampling as the norm would be wanted, they could follow some convention of their own to make it simple.

I've updated the verbiage in the Description accordingly. Following is a revision of my comment. Now, how do you think point 3 should work?

  1. Default: The value of mw.user.sessionId() is not sent in the event.
  2. Using session sampling: set optional boolean of isSessionSampled = true. The value of mw.user.sessionId() is sent in the event.
  3. Using different level of session sampling: set optional double precision number of customSessionSamplingRateForFeature = (0.00,1.00]. This would add a non-null customFeatureSessionSamplingRateForFeature field to the event itself.

I guess the specific implementation of how to make events in (3) be able to correlated to (2) is an exercise for later. It's sufficient to say that sessions included in (3) must also be included in (2), although the reverse isn't true...and that rate in (3) must be greater than the systemwide default sampling rate.

This said, @leila would you be able to explain more about the sampling rate mechanics you had in mind?

@phuedx do you think it might be sensible to simply make sendBeacon a pre-requisite at this point for client side event logging?

I'm not referring to recording virtual pageview like a preview with EL as the transport, which I gather might need to make an XHR, but rather the other logging.

dr0ptp4kt added a comment.EditedOct 21 2018, 11:23 PM

@Tbayer @Neil_P._Quinn_WMF @chelsyx @mpopov @nettrom_WMF curious about your thinking here for session overlap between events that are sent at the global (perhaps per-project, if we need that) default and those that are oversampled for the sessions.

Is it sufficient for the session ID of oversampled events to merely also be eligible for default session sampling? I was thinking you *wouldn't* want oversampled events to only incidentally overlap with global default sampled events' session IDs as it would probably create too sparse of data for joins.

Also, should we just always be sending the session sampling rate along, even for events that land because of being on the default global sampling? It dawned on me we might change the thresholds at some point and it's usually best to have the explicit valies instead of having to work backward to figure out what the sampling rate was before such a change.

Are you aware of good hashing and sampling routines to facilitate these sorts of sampling techniques that we should consider?

Also, should we just always be sending the session sampling rate along, even for events that land because of being on the default global sampling?

yes! :)

leila added a comment.Oct 23 2018, 5:00 PM

This said, @leila would you be able to explain more about the sampling rate mechanics you had in mind?

Flexibility over sampling mechanism is key for research. This means we would like to see support for sampling based on (and not limited to) one or more of the following:

  • request
  • session
  • unique device
  • country
  • some editor characteristics such as edit count, last article edited, ...
  • some combinations of the above

I want to emphasize that I understand this is a long and idealistic list. :) There are always solutions if we don't have the above, but they can be very resource-intensive (both on our end, and in the case we interact with users, on the user end. For example, if we can't sample by country and we're interested to run quicksurveys in one specific region, we won't have a choice but collecting data from everyone and discarding the parts that we don't need which is not something we should be doing.)

Nuria added a comment.EditedOct 23 2018, 5:17 PM

I want to emphasize that I understand this is a long and idealistic list. :)

nah, most these are all possible right now as long as you equal device with browser session. which more often than not is the case.

Actually it is possible to sample right now per request and pageview.

Sampling by country (in other than fresh hits where geo cookies are not defined) should also be possible (we just did something similar to oversample some countries for perf when we rolled out Singapore datacenter) and sampling by editor activity is also doable.

The only hard one on your list is last_page_edited as that byte of info is probably not available to the client (not super sure about this one).

dr0ptp4kt updated the task description. (Show Details)Oct 30 2018, 11:44 AM

@leila to clarify, which of the following do you desire?

  1. Have the capability to (a) activate sampling on those dimensions independent of any A/B testing, and (b) correlate such events
  2. Have an A/B testing suite that makes it possible to activate interventions on one or more of those dimensions (multivariate testing)

Would it be possible to list several use cases?

While on the one hand the proposal here suggests a more specific consult once correlation tips past the 1 in 10,000 rate, it's worth considering how to engineer the underlying solution to support potential use cases with greater ease. As @Nuria notes some cases are trivially done, although of course we want to formalize more of the trivial, as well as non-trivial, techniques.

leila added a comment.Nov 1 2018, 11:11 PM

I want to emphasize that I understand this is a long and idealistic list. :)

nah, most these are all possible right now as long as you equal device with browser session. which more often than not is the case.

the unique device part is important since for quite a few applications we need to check back on the user a few days/weeks before or after the event of interest.

Sampling by country (in other than fresh hits where geo cookies are not defined) should also be possible (we just did something similar to oversample some countries for perf when we rolled out Singapore datacenter) and sampling by editor activity is also doable.

what do you mean by fresh hits?

The only hard one on your list is last_page_edited as that byte of info is probably not available to the client (not super sure about this one).

is unique device doable? and, is mixing of the ones already available straightforward? For example, requests from country_x who that have session length more than y?

leila added a comment.EditedNov 1 2018, 11:20 PM

@leila to clarify, which of the following do you desire?

  1. Have the capability to (a) activate sampling on those dimensions independent of any A/B testing, and (b) correlate such events
  2. Have an A/B testing suite that makes it possible to activate interventions on one or more of those dimensions (multivariate testing)

both (except that I'm not sure why (b) is important. the output of (a) can be analyzed in many ways including checking for correlation.).

Would it be possible to list several use cases?

  • For unique device:
    • (an example that can include both scenarios you mentioned above) Any kind of experiment or data collection that requires asking the same unique device multiple questions across a period of time. For example, when we want to learn about how users "learn" on Wikipedia, we need to be able to interfere with their experience on Wikipedia in multiple stages of their interaction and ask them questions. Not being able to say which unique device has answered the first batch of questions is a blocker for this line of research.
    • (an example of the first scenario) For Why We Read Wikipedia we rely on a notion of unique device because we need to include features in the model which say how often the user comes to Wikipedia. At the moment, we go through a lengthy and computationally intensive (and not very accurate way) of building sessions, defining a notion of unique device, and then analyzing that unique device across weeks. The better solution would be to be able to sample by unique device and keep that information throughout the analysis.

The examples above are both kind (a).

While on the one hand the proposal here suggests a more specific consult once correlation tips past the 1 in 10,000 rate, it's worth considering how to engineer the underlying solution to support potential use cases with greater ease. As @Nuria notes some cases are trivially done, although of course we want to formalize more of the trivial, as well as non-trivial, techniques.

I'm sorry, I'm missing this part. What do you mean by correlation rate?

Nuria added a comment.EditedNov 2 2018, 12:04 AM

The use cases here are of different nature: @leila is asking for session information to be kept with reader logs (webrequest data) so she does not have to compute "signatures" as part of her research to estimate sessions. I do not think this is going to happen in the near term as it equals adding long-lived session tokens to all our pageview data, which we decided not to do some years back and the details of why the community (and much of our staff) feels that is too intrusive privacy wise are throughly documented elsewhere. The issue of long lived sessions does not have much in common with sampling.

The question of whether you can sample events per session with stickiness is a different one, and the answer to that is yes, you can do that as of today deterministically and decide that event 1 and event2 are always going to be sampled for session "25". Session here means " identifier assigned to your browser until you close it down" . This identifier is sent in eventlogging events but it is not sent in general requests. It will be reset when you re-start your browser.

As I said for any session that has a GeoIP cookie set you CAN sample by geography, it is doable . There are just no wrapper methods in eventlogging that make this use case easier.

kzimmerman renamed this task from Define cross-schema event correlation approach to Define cross-schema event stitching approach.Nov 2 2018, 6:17 PM
kzimmerman updated the task description. (Show Details)

I'm interested in filling out the TODOs here.
TODO: define session identifier and any hashing technique
TODO: define default and by-enrollment correlation configuration scheme
First, my impression is that mw.user.sessionId() as the basis for determining whether to send events in a session for schemas using this approach (presumably new schemas by default) would be fine. However, does the raw value of mw.user.sessionId() sufficiently avoid collisions for longitudinal analysis or should it be salted (and possibly hashed) with something before being included in logged events? @Tbayer @Krinkle @Gilles I believe you were discussing this in one of the related tickets.

Yes, we discussed collision avoidance as part of T201124 and increased the length of mw.user.sessionId() to a value that should be safe for all foreseeable scenarios (see in particular T201124#4521002). I'm not quite sure what salting and hashing has to do with that though.

Nuria added a comment.Nov 2 2018, 8:21 PM

As @Tbayer mentioned mw.user.sessionId() is not is risk of colliding across schemas. The session token is 2^80 bits. Please see: https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki.user.js#L39

phuedx added a comment.Nov 5 2018, 9:26 AM

@phuedx do you think it might be sensible to simply make sendBeacon a pre-requisite at this point for client side event logging?

Looking at the data from Can I use… and our browser compatibility matrices, this would exclude the following UA versions from all of Wikimedia's instrumentation efforts moving forward (percentage of requests received shown in brackets, which are taken from our Simple Request Breakdowns Dashiki).

@leila @Nuria @Tbayer @phuedx thanks. I'll try to respond in order.

  • For unique device:
    • (an example that can include both scenarios you mentioned above) Any kind of experiment or data collection that requires asking the same unique device multiple questions across a period of time. For example, when we want to learn about how users "learn" on Wikipedia, we need to be able to interfere with their experience on Wikipedia in multiple stages of their interaction and ask them questions. Not being able to say which unique device has answered the first batch of questions is a blocker for this line of research.

@leila, following up on our brief discussion, this sounds like this might be a special case of A/B testing with LocalStorage (cf. T132604: Prepare HoverCards for A/B test on smaller wikipedia project) for longitudinal analysis. Depending on whether aggregate analysis is sufficient, it seems this could be (3) (aggregate analysis of cohorts is sufficient) or (1). Technically the A/B testing work is a different part of the Better Use of Data program, but I think we'll need to discuss this distinctive use case.

  • (an example of the first scenario) For Why We Read Wikipedia we rely on a notion of unique device because we need to include features in the model which say how often the user comes to Wikipedia. At the moment, we go through a lengthy and computationally intensive (and not very accurate way) of building sessions, defining a notion of unique device, and then analyzing that unique device across weeks. The better solution would be to be able to sample by unique device and keep that information throughout the analysis.

Provided the privacy statement is clear, it seems like upon survey there could be a notion of oversampling for one or more schemas. For example, it might be sufficient to oversample for the given user at https://meta.wikimedia.org/wiki/Schema:ReadingDepth meta:Schema:ReadingDepth. This might be a case of (1) in the Description depending on how it's done.

I'm sorry, I'm missing this part. What do you mean by correlation rate?

We've since renamed this to "stitching" instead of "correlation". The thing I was trying to get at was we might have some global default (with potential per-wiki overrides) at which stitching together events is normative provided the team enrolls the schema for such stitching.

Yes, we discussed collision avoidance as part of T201124 and increased the length of mw.user.sessionId() to a value that should be safe for all foreseeable scenarios (see in particular T201124#4521002). I'm not quite sure what salting and hashing has to do with that though.

@Tbayer thanks again. The salting and hashing had to do with the notion of something in the app server layer to further reduce the chance of a collision. But given that we're not concerned about collisions, it seems moot.

Thanks @phuedx. I'll leave this for @kzimmerman to review in the future, but it seems like this might still be worth consideration for determining event sampling.

The question of whether you can sample events per session with stickiness is a different one, and the answer to that is yes, you can do that as of today deterministically and decide that event 1 and event2 are always going to be sampled for session "25". Session here means " identifier assigned to your browser until you close it down" . This identifier is sent in eventlogging events but it is not sent in general requests. It will be reset when you re-start your browser.

@Nuria any guidance here on how this is done most optimally in practice if the sampling rates are actually different for event type 1 and event type 2 and we want to be able to stitch them together?

Nuria added a comment.Nov 6 2018, 10:10 PM

any guidance here on how this is done most optimally in practice if the sampling rates are actually different for event type 1 and event type 2 and we want to be able to stitch them together?

I might have not understood question as this is already a solved problem on eventlogging data (data is retained and cross linked across schemas for 90 days, not forever)

@dr0ptp4kt given a sessionId and a sampling rate whether or not you sample that session is not probabilistic, it is deterministic. Thus if event1 is sampled 1/10 (every session that is a multiple of 10 is sampled) and event2 is sampled 1/100 per session (every session that is a multiple of a 100 is sampled) every sessionId sampled for event2 is de facto already sampled for event1. Makes sense?

https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/modules/ext.eventLogging.subscriber/subscriber.js#L88

dr0ptp4kt added a comment.EditedNov 7 2018, 12:24 PM

@Nuria thanks. You understood the question well. Okay, so my read of sessionInSample and randomTokenMatch is that the populationSize values between different schemas would need to have a common base value so that they divide cleanly in order to guarantee intersection, as it's a divisor in a modulo calculation. Do I have that right?

I'm thinking a utility method that constrains the allowed values for populationSize is one way to ensure programmers don't unintentionally have mismatches in intersection defying their expectations. As a thought experiment imagine if someone session samples at 1/107 for one type of event and session samples at 1/10 for another kind of event and expects perfect intersection...but only sees incidental intersection for sessions that share a modulo 1070 === 0 relationship.

dr0ptp4kt updated the task description. (Show Details)Nov 15 2018, 2:37 AM

If we include the sampling settings in the event data itself, then it will at least be unsurprising during analysis if the sampled events intersection doesn't match up. This is easy if our sampling setting were simple ratios (e.g. sample_rate: 0.1), but we might need a more complex sample_settings object if we eventually want to support sampling by different values (not just session).

Nuria added a comment.Dec 12 2018, 6:45 PM

we eventually want to support sampling by different values (not just session).

Clarifying that at this time eventlogging code supports sampling per "page" and per 'session". Examples provided above work in both scenarios, mediawiki creates a pagetoken (unique per pageview) and a session token (unique per device, alive until user closes browser window).