For technical reasons, it's best that we calculate the edit count bucket (i.e. "< 10", "10-100"and legal reasons, ...) before recording towe should calculate the edit count bucket in frontend EventLogging. Otherwise, edit count may be purged after 90 days and we lose the informationwe can't preserve any information about the edit count beyond 90 days because the fine-grained information is often uniquely identifying. The bucket is privacy-respecting and can be kept aftered count is safe to keep beyond the data retention window.
This should be done for all metrics.While updating these schemas, let's also change our bucketing to match the intervals used by other extensions: {0, 1-4, 5-99, 100-999, 1000+}
Two long-term considerations to keep in mind as we do this:Be cautious about the migrations required:
* The effects of a mismatch between our edit count bucket boundaries and bucketing in other projects might erode privacy (exact edit count can be estimated more accurately than designed), our analyses cannot be directly compared to others, ... Maybe these are small effects, but should be evaluated. The standard seems to be [0, 1-4, 5-99, 100-999EventLogging must be kept backwards-compatible, 1000+]an old event should still validate under the new schema (for silly reasons).
* Ideally we can reuse the bucketing codeStale aggregations and Graphite metrics will conflict with the new enums, so must be removed.
* Old events must not break aggregation, but should be skipped. Should this live in core?This can be accomplished by checking the schema revision level.
* Each patch must be robust against a rollback of the others.
Table of components affected:
TBD.
Acceptance criteria:
[] Edit count bucket should always be sent along with front-end events. In EventLogging?Try to re-use core bucketing code.
[] Front-ends events use the MediaWiki-core bucket labels ("1-4 edits", "1000+ edits", etc.)
[] Some of the aggregations should be segmented by edit count bucket (TBD: document which)
[] Front-end should send a `null` bucketed edit count for anonymous users.
[] Aggregations should include anonymous as its own edit count bucket.
[] Aggregations skip events with old schemas (migration code should be marked as temporary). There are already reimplementations in QuickSurveysOld events may still be encountered, WikimediaEventsfor example in the case that logging patches are reverted.
[] Cached aggregations and Graphite metrics should be purged, and Popups (all use the above standard)the reporting start date pushed forward to correspond to new schema deployment.