Page MenuHomePhabricator

Measure the reference use and re-use in VE
Open, Stalled, Needs TriagePublic

Description

Some of the issues we're working on during the WMDE-References-FocusArea affect how user can interact with the VisualEditor.

Assumptions:

  • Users don't use the reference re-use dialog in VE because it's faulty and incomplete in some cases.
    • Fixing the issues around the re-use dialog will increase it's usage.
    • Fixing use cases about the usage of references in VE will increase how people add and re-use references there.
  • Users switch back and forth between visual and Wikitext mode to create/reuse/edit references
    • Fixing use cases about the usage of references in VE will decrease the need to switch back and forth

Indicators:

  • Users use the cite and re-use dialogs more often per session.
  • Users use the cite and re-use dialogs more successfully.
  • Users will less often need to switch between Wikitext mode and visual mode per session.

Measurements:
T362347: Log events for some simple interactions in the VE cite dialogs:

  • Opens of the cite dialog. Make sure this works with/without Citoid.
  • Opens of the re-use tab
  • Actively uses of the re-use dialog ( defined by interacting with the search box/scrolling through the list ). This is sticky per opening of the dialog, so we are recording either 0 or 1.
  • Left the re-use dialog after actively using it without adding a reference re-use (e.g. pressing Esc or clicking outside of the dialog).
  • Left the re-use dialog successfully by adding a reference re-use
  • Tries to add a new reference using the dialog. We don't care if this actually ended in a new reference being added to the article, only that a selection in the cite dialog was made.

T362358: Log events for copy and paste action around references in VE

  • How often users paste from the clipboard and create a new reference by doing so.
  • How often users paste and create a reference re-use with that
  • Make sure we have a baseline for the number of edit sessions with VE to normalize the above values to.
  • Switches between VE and wikitext editing per edit session. Which wikitext editor is used is not relevant, but we must make sure they are all tracked.

See editor-switch in https://www.mediawiki.org/wiki/VisualEditor/FeatureUse_data_dictionary

Also consider:

  • Was the "citation tool" in use, or the plain cite dialog? ( irrelevant here )
  • Track edit session ID in the raw data. While we don't care about individual users but only need trends over time, we need sessions for normalization and for grouping. Most notably when users switch a lot between editors and start doing this less often.
  • VE events are normally sampled heavily, at 6.25%. However, wgWMESchemaVisualEditorFeatureUseSamplingRate and wgWMESchemaEditAttemptStepSamplingRate were set to 100% last year and seems to have been left this way.
  • Data retention should be considered: VE schemas are already configured to save "sanitized" data after 30 days, but if we add a new schema we will also need to configure sanitization in https://github.com/wikimedia/analytics-refinery/blob/master/static_data/sanitization/event_sanitized_analytics_allowlist.yaml

Event Timeline

thiemowmde subscribed.

Warning: These dialogs are potentially in 3 different extensions, including Cite and Citoid!

awight updated the task description. (Show Details)
WMDE-Fisch added a subscriber: ElineWMDE.

We should discuss the stats per edit session with @ElineWMDE before we do anything here.

WMDE-Fisch renamed this task from Track how often people sucessfully use the VE reference dialogs to Meassure the reference use and re-use in VE.Mar 4 2024, 5:56 PM
WMDE-Fisch updated the task description. (Show Details)
WMDE-Fisch renamed this task from Meassure the reference use and re-use in VE to Measure the reference use and re-use in VE.Mar 4 2024, 6:30 PM
WMDE-Fisch changed the task status from Open to Stalled.Apr 24 2024, 5:14 PM

Still waiting for L3SC review ...

Hi @WMDE-Fisch! I'll be conducting this review on phab (at the request of the WMF Legal team until there's a formal agreement between WMF and WMDE). Here's what I originally posted on the L3SC ticket:

So sorry this has taken ~6 weeks to be reviewed — I was just assigned this request and would love to get some more information on the following questions that I wasn't able to get a clear sense of in the linked phab tickets:

  • Are any user identifiers (IP address/UA, user ID, or username) going to be collected?
  • Are any geographic identifiers (derived from IP address) going to be collected?
  • Are any page identifiers (page title, page ID) going to be collected?
  • Given that events occur one at a time, how will the team plan to collect aggregated data? From what raw data source will they be aggregating it?
  • The team proposes a data retention timeframe of two years. Is this retention period for the aggregated data, or for the underlying source data? For underlying source data, the typical data retention period is usually 90 days (see: https://foundation.wikimedia.org/wiki/Legal:Data_retention_guidelines)

If you can get me the answers to these questions, I'm hoping I can get this reviewed by the end of the week next week!

Hey @Htriedman

Thanks for reaching out and giving us the chance to handle this here.

For the raw source data, we're using the VisualEditorFeatureUse event schema. There are some actions already tracked and we'll add some others on top of the list in VisualEditor/FeatureUse_data_dictionary. - Since we're using this schema the raw data comes with some of the points mentioned. For the aggregation we'll only use the sanitized version of that though.

  • Are any user identifiers (IP address/UA, user ID, or username) going to be collected?
  • Are any geographic identifiers (derived from IP address) going to be collected?

Yes, the raw data collects these, the sanitized data does only include the user ID. But we won't keep them in the aggregated data.

  • Are any page identifiers (page title, page ID) going to be collected?

I don't see these collected in the raw data. But anyways we won't keep them in the aggregated data.

  • Given that events occur one at a time, how will the team plan to collect aggregated data? From what raw data source will they be aggregating it?

The current plan is to aggregate the raw data in Superset by creating a custom dataset there. As mentioned the raw source will be the event data from the VisualEditorFeatureUse schema.

We're aggregating data in Superset on demand and would use the caching there for around 1 or 4 weeks. For the sanitized source data we're not planing to change anything about the retention periods that already apply. Currently it's kept indefinitely.

Got it! Since this is downstream of an existing event schema, and collects no user identifiers, granular geographic identifiers, or page identifiers, this data collection activity is lower risk. You can go ahead and proceed with building this.

If you are planning on making any of this data public, please consult the Data publication guidelines for guidance.