Page MenuHomePhabricator

Add edit count bucketing to all metrics
Closed, ResolvedPublic

Description

For technical and legal reasons, we should calculate the edit count bucket in frontend EventLogging. Otherwise, we can't preserve any information about the edit count beyond 90 days because the fine-grained information is often uniquely identifying. The bucketed count is safe to keep beyond the data retention window.

While updating these schemas, let's also change our bucketing to match the intervals used by other extensions: {0, 1-4, 5-99, 100-999, 1000+}

Be cautious about the migrations required:

  • EventLogging must be kept backwards-compatible, an old event should still validate under the new schema (for silly reasons).
  • Stale aggregations and Graphite metrics will conflict with the new enums, so must be removed.
  • Old events must not break aggregation, but should be skipped. This can be accomplished by checking the schema revision level.
  • Each patch must be robust against a rollback of the others.

Components affected:

  • EventLogging
    • Reusable bucketing
  • CodeMirror
  • TemplateData editor
  • TemplateWizard
  • VisualEditor

Acceptance criteria:

  • Edit count bucket should always be sent along with front-end events. Try to re-use core bucketing code.
  • Front-ends events use the MediaWiki-core bucket labels ("1-4 edits", "1000+ edits", etc.)
  • Some of the aggregations should be segmented by edit count bucket (TBD: document which)
  • Front-end should send a null bucketed edit count for anonymous users.
  • Aggregations should include anonymous as its own edit count bucket.
  • Aggregations skip events with old schemas (migration code should be marked as temporary). Old events may still be encountered, for example in the case that logging patches are reverted.
  • Cached aggregations and Graphite metrics should be purged, and the reporting start date pushed forward to correspond to new schema deployment.

Event Timeline

awight set the point value for this task to 2.Dec 12 2020, 1:12 AM
Lena_WMDE changed the point value for this task from 2 to 5.

Taking this out of the sprint, we've decided it needs to wait to January due to code freeze.

Change 650237 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/TemplateWizard@master] Don't send user fields when not logged in

https://gerrit.wikimedia.org/r/650237

I'm starting to think that "0" is an important category for user edit count bucketing, and there we could follow the precedent in other extensions to do so. It *almost* also fits with the elegant theme of our buckets although 10^0 = 1, which would be "over 1", but I'm proposing that we instead use "0" and "over 0" since 1 is not an interesting category.

I'm motivated by the vary large experience gap between a registered user who has made zero edits vs. 1 edit.

"0" and "anonymous" would remain separate categories.

Change 650237 merged by jenkins-bot:
[mediawiki/extensions/TemplateWizard@master] New event semantics for performer fields

https://gerrit.wikimedia.org/r/650237

After some discussion, we'll go ahead with aligning the buckets. This is probably best handled by a subtask for each set of metrics.

awight changed the point value for this task from 5 to 8.

Change 656146 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/core@master] Core support for edit count bucketing

https://gerrit.wikimedia.org/r/656146

Change 656159 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/CodeMirror@master] Log user edit count bucket

https://gerrit.wikimedia.org/r/656159

Change 656210 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/reportupdater-queries@master] [WIP] Segment CodeMirror metrics by user edit count

https://gerrit.wikimedia.org/r/656210

Change 656146 abandoned by Awight:
[mediawiki/core@master] Core support for edit count bucketing

Reason:
Moving to EventLogging

https://gerrit.wikimedia.org/r/656146

Change 656427 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/EventLogging@master] [WIP] User edit count bucketing

https://gerrit.wikimedia.org/r/656427

Change 656901 had a related patch set uploaded (by WMDE-Fisch; owner: WMDE-Fisch):
[schemas/event/secondary@master] [WIP] Update schema with core bucket labels

https://gerrit.wikimedia.org/r/656901

Change 656927 had a related patch set uploaded (by WMDE-Fisch; owner: WMDE-Fisch):
[mediawiki/extensions/TemplateWizard@master] Use schema with core bucket labels

https://gerrit.wikimedia.org/r/656927

Lena_WMDE changed the point value for this task from 8 to 5.Jan 20 2021, 9:29 AM

Change 657557 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/VisualEditor@master] Add edit count bucket to VisualEditorTemplateDialogUse events

https://gerrit.wikimedia.org/r/657557

Change 657634 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/reportupdater-queries@master] [WIP] Use edit count bucket sent by TemplateWizard

https://gerrit.wikimedia.org/r/657634

Change 657635 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/reportupdater-queries@master] [WIP] Update event bucketing for visualeditor events

https://gerrit.wikimedia.org/r/657635

Ottomata added subscribers: mforns, Ottomata.

@awight seeking feedback from Analytics, perhaps @mforns has some thoughts?

Change 656427 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] User edit count bucketing

https://gerrit.wikimedia.org/r/656427

Change 656159 merged by jenkins-bot:
[mediawiki/extensions/CodeMirror@master] Log user edit count bucket

https://gerrit.wikimedia.org/r/656159

Change 657557 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Add edit count bucket to VisualEditorTemplateDialogUse events

https://gerrit.wikimedia.org/r/657557

Change 656210 abandoned by Awight:
[analytics/reportupdater-queries@master] [WIP] Segment CodeMirror metrics by user edit count

Reason:
became redundant

https://gerrit.wikimedia.org/r/656210

Change 656210 restored by Awight:
[analytics/reportupdater-queries@master] [WIP] Segment CodeMirror metrics by user edit count

https://gerrit.wikimedia.org/r/656210

awight removed awight as the assignee of this task.Feb 1 2021, 4:55 PM
awight moved this task from Doing to Watching on the WMDE-TechWish (Sprint-2021-01-20) board.
awight removed the point value for this task.Feb 2 2021, 12:06 PM

FYI, @awight https://meta.wikimedia.org/w/index.php?title=Schema%3ACodeMirrorUsage&type=revision&diff=20981443&oldid=20959169 was a backwards incompatible change and did indeed cause the new field to be added to the hive table as an integer. This is causing some refinement errors during ingestion about corrupt records.

I manually altered the Hive table to fix:

alter table event.codemirrorusage change  `event` `event` struct<editor:string,enabled:boolean,toggled:boolean,session_token:string,user_id:bigint,edit_start_ts_ms:bigint,user_edit_count_bucket:string>

Mentioned in SAL (#wikimedia-analytics) [2021-02-02T19:29:54Z] <ottomata> manually altered event.codemirrorusage to fix incompatible type change: https://phabricator.wikimedia.org/T269986#6797385

I manually altered the Hive table to fix:

Wow, thank you for dealing with this and so sorry for the mess! If it's helpful going forward, I'm open to migrating any / all of our eventlogging schemas to the new infrastructure where this can't happen again.

Hm, @awight, I don't know why, but the schemas you mention in this ticket are not tracked in EventLoggingg Schema Audit sheet which is what we are using to plan T259163: Migrate legacy metawiki schemas to Event Platform.

I had thought all schemas were at least listed in that spreadsheet...OH, it is because they are new!

On October 5 @sdkim sent an email with the subject 'ACTION REQ -- Migration Plan: Modern Event Platform', and then another one on November 23 with the subject 'INFO -- Starting MEP Migration of Confirmed Schemas'.

Please add your schemas to the EventLoggingg Schema Audit sheet and mark them as 'Migrate' in the To Migrate or Deprecate column.

@Ottomata
Oh dear. Yeah I was worried that we are your recurring legacy nightmare :-)

Can you have me (first.last@wikimedia.de) added to the mailing list where these notices went out? Or if it's public, I can subscribe myself.

We'll add our new schemas to the audit sheet, and going forward will be careful to only create MEP schemas.

Change 656901 merged by Awight:
[schemas/event/secondary@master] Update schema with core bucket labels

https://gerrit.wikimedia.org/r/656901

Change 656927 merged by Awight:
[mediawiki/extensions/TemplateWizard@master] Use schema with core bucket labels

https://gerrit.wikimedia.org/r/656927

awight claimed this task.