THIS TASK DEFINITION IS A WORK IN PROGRESS
Problem definition
Better Use of Data program output 3.1 requirements analysis identified the following needs:
- Complex workflows that contain multiple events should be instrumented in such a way that the multiple events can be gracefully stitched together to re-compose the sequence of events undertaken by the user.
- Events from otherwise unrelated features can be stitched together via common identifiers. For instrumentations that use sampling, this requires the ability to apply the same sampling to multiple events.
This task will be used to collaboratively define the mechanics for meeting these needs.
Notes
- This topic focuses on the Wikimedia production content projects. It is not inclusive of donate.wikimedia.org and donation banners.
- This is in draft, and so verbiage is likely to change.
- Generally, logged data fields should be constructed in the least identifying way possible. For example, if it is sufficient to use the edit bucket count of a user in order to perform longitudinal analysis on a user cohort instead of using user IDs, the edit bucket count should be used. As another example, if it is sufficient to use the namespace for articles instead of actual article titles, the namespace for articles should be used.
Use Cases
Broadly speaking, there are five cases of event logging. Here are the five cases and their correlation approaches.
1. Users who have opted into data collection explicitly. For such users, correlation may be done with a fixed identifier and potentially all event logging may be done without sampling. In the apps this is usually an app installation ID. On the web this would most likely be a similar type of value stored in a localStorage variable (n.b., this is not presently being entertained on the web (e.g., for narrowly scoped longitudinal analysis for a small random sample of users)). Data should be handled to comply with the data retention guidelines for data persisting beyond 90 days.
2. Users who are opted out of data collection explicitly. Users on the web indicate this via Do Not Track. Users on the apps are opted out of data collection by default. For such users, event logging should not take place and, therefore, besides standard tracing behavior on web logs or MediaWiki database inserts and the collection in #5 below, correlation would typically be out of scope. (N.B., Virtual Pageviews and other intentful impressons may use the same transport used for event collection.)
3. Users who have neither opted in nor opted out explicitly - the largest base of users. For new event logging schemas, event logging on the per-session basis would be available with a boolean flag for the systemwide specified default. Recommendation: 1 in 10,000 sessions on the web. What this means is that there's a 1 in 10,000 chance that, for all event logging using the boolean flag, such events in that session will be captured. Higher sampling ratios on a per-wiki or per-feature basis when needed to account for a specific purpose (e.g., to address issues with data sparsity) may be established in consultation with Privacy on a case-by-case basis.
4. Users who login or make contributions may have two classes of correlation applied:
4.1 In-feature unsampled contribution and persisted user data behavior. Contribution feature and persisted user data behavior may be tracked in an unsampled fashion and may include (although does not require) user IDs (or, upon contribution from anonymous access, masked IP addresses). As a side effect, such behavior may be stitched with events in #3.
4.2. Out-of-feature behavior. Such behavior may on a case-by-case basis, in consultation with Privacy, be captured on an unsampled basis. Such behavior to be captured must be clearly scoped and should be captured in a way that proactively avoids easy linking of identity to consumption habits.
5. Security, privacy, and error events. Such events may be collected in an unsampled fashion on Wikimedia infrastructure and correlated with any other collected events.
Technical Specifics
TODO: define session identifier and any hashing technique
TODO: define default and by-enrollment correlation configuration scheme
Related
T201124: Provide standard/reproducible way to access a PageToken
T199898: EventLogging sanitization
T201409: Harmonise the identification of requests across our stack
Data retention guidelines
Privacy policy
Instrumentation DACI