== THIS TASK DEFINITION IS A WORK IN PROGRESS ==
=== Problem definition ===
[[ https://www.mediawiki.org/wiki/Wikimedia_Audiences/Better_use_of_data | Better Use of Data ]] program [[ https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2018-2019/Audiences#Outcome_3:_Data_collection | output 3.1]] requirements analysis identified the following needs:
- Complex workflows that contain multiple events should be instrumented in such a way that the multiple events can be gracefully analyzed together to re-compose the sequence of events undertaken by the user.
- Events from otherwise unrelated features can be analyzed together via common identifiers. For instrumentations that use sampling, this requires the ability to apply the same sampling to multiple events.
This task will be used to collaboratively define the mechanics for meeting these needs.
=== Notes ===
* This topic focuses on the Wikimedia production content projects. It is not inclusive of donate.wikimedia.org.
* This is in draft, and so verbiage is likely to change.
* Generally, logged data fields should be constructed in the least identifying way possible. For example, if it is sufficient to use the edit bucket count of a user in order to perform longitudinal analysis on a user cohort instead of using user IDs, the edit bucket count should be used. As another example, if it is sufficient to use the namespace for articles instead of actual article titles, the namespace for articles should be used.
=== Use Cases ===
Broadly speaking, there are five cases of event logging. Here are the five cases and their correlation approaches.
1. Users who have opted into data collection explicitly. For such users, correlation may be done with a fixed identifier and potentially all event logging may be done without sampling. In the apps this is usually an app installation ID. On the web this would most likely be a similar type of value stored in a localStorage variable. Data should be handled to comply with the [[https://meta.wikimedia.org/wiki/Data_retention_guidelines | data retention guidelines]] for data persisting beyond 90 days.
2. Users who are opted out of data collection explicitly. Users on the web indicate this via Do Not Track. Users on the apps are opted out of data collection by default. For such users, event logging should not take place and, therefore, besides standard tracing behavior on web logs or MediaWiki database inserts and the collection in #5 below, correlation would typically be out of scope. (N.B., Virtual Pageviews and other intentful impressons may use the same transport used for event collection).
3. Users who have neither opted in nor opted out explicitly - the largest base of users. For new event logging schemas, event logging on the per-session basis should be on by default at a systemwide specified level. Recommendation: 1 in 10,000 sessions on the web. What this means is that there's a 1 in 10,000 chance that, for all event logging using the defaults, events will be captured in a given session. Higher sampling ratios on a per-wiki or per-feature basis when needed to account for a specific purpose (e.g., to address issues with data sparsity) may be established in consultation with Privacy on a case-by-case basis.
4. Users who login or make contributions may have two classes of correlation applied:
4.1 In-feature unsampled contribution behavior. Contribution feature behavior may be tracked in an unsampled fashion and may include user IDs (or, upon contribution from anonymous access, masked IP addresses). As a side effect, such behavior may be correlated with events in #3.
4.2 Out-of-feature behavior. Such behavior may on a case-by-case basis, in consultation with Privacy, be captured on an unsampled basis. Such behavior to be captured must be clearly scoped and should be captured in a way that proactively avoids easy linking of identity to consumption habits.
5. Security, privacy, and error events. Such events may be collected in an unsampled fashion and correlated with any other collected events.
=== Technical Specifics ===
TODO: define session identifier and any hashing technique
=== Related ===
{{T201124}}
[[https://meta.wikimedia.org/wiki/Data_retention_guidelines|Data retention guidelines]]
[[https://foundation.wikimedia.org/wiki/Privacy_policy|Privacy policy]]