Page MenuHomePhabricator

Reading_depth: deactivate eventlogging instrumentation
Closed, ResolvedPublic

Description

Per my last conversation with @kzimmerman we think this data is no longer needed/used. If that is the case we should probably remove the instrumentation

Pinging @phuedx who * i think* instrumented this earlier.

This schema has a large volume of events: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=ReadingDepth

NOTE: schema disabled as of August 20, 2019

Original launch task: T155639

Event Timeline

Jdlrobson reassigned this task from Jdlrobson to phuedx.
Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.
Jdlrobson subscribed.

I do not think we should be in a rush to remove this instrumentation.

Rather I think this data is useful in general for learning about Wikimedia audiences. I used this data for a technical report and research paper (with @ovasileva and @Tbayer ) which we will be present at opensym and wikimania. You can read a draft of that report here.

In fact, I would like to explore ways that we might release an aggregation of the data from the report to the public. This can support a broader community of researchers interested in studying audiences.

If the data is too expensive to store, then we should consider lowering the sampling rate instead.

@Groceryheist storage is not the problem for us, but the amount of events that are coming through. This affect performance for the rest of EL users. We have no problem with it if this data is going to be used at some point soon, otherwise we can disable it and maybe enable it back in the future.

Or, it can be migrated to EventGate (sometimes in Q2 or Q3) and we don't have to worry about it affecting other EL stuff. :)

Note there are many tasks to add features to it ( https://phabricator.wikimedia.org/T219212, https://phabricator.wikimedia.org/T200093, https://phabricator.wikimedia.org/T208594, https://phabricator.wikimedia.org/T207899) so depending on what happens here we will also want to review those open tickets.

I'm not against disabling the instrumentation for now (but keeping the dataset per T229042#5370440), given the… alarmingly high event rate. To be clear, disabling the instrumentation must include not sending it to the client.

My understanding is that the session depth metric itself is something that we wish to continue tracking but that the current implementation can be greatly simplified in such a way as to satisfy some of the feature requests that @Jdlrobson mentioned above. Product Analytics and/or Analytics Infrastructure are best positioned to do that work and own it moving forward.

Finally, with no sense of urgency for T208594: ReadingDepth: Add some new fields to the schema and T207899: ReadingDepth events should only be sent for pageviews and Readers Web's focus on the Desktop Refresh project, it's hard for me to argue that we should keep the current implementation in its current form.

@phuedx I vote also for disabling the instrumentation, can we use this ticket for this purpose?

Change 531290 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Disable Wikimedia ReadingDepth

https://gerrit.wikimedia.org/r/531290

Jdlrobson renamed this task from Reading_depth remove eventlogging instrumentation? to Reading_depth: remove eventlogging instrumentation.Aug 20 2019, 5:46 PM

+1 on disabling for now and keeping the dataset.

@Groceryheist here's the proposal for SessionLength, which we want to use to replace ReadingDepth: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength

The data exploration you undertook is interesting, particularly the analysis of geographic differences in reading times for pages. I think it would be even more helpful to understand overall session time, and the session length metric will create lower client-side loads.

Change 531290 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable Wikimedia ReadingDepth

https://gerrit.wikimedia.org/r/531290

Mentioned in SAL (#wikimedia-operations) [2019-08-20T23:19:18Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: 13be059: Disable Wikimedia ReadingDepth (T229042) (duration: 00m 56s)

Change 531489 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Removing loading of Reading_Depth into druid

https://gerrit.wikimedia.org/r/531489

Sorry I lost track of this bug until today. I think it is really regrettable to turn off the instrumentation. The utility of the data is greatly lessened by gaps in the collection window.My understanding is that the instrument should only send two events for each page view. The sampling rate has been quite high at 10%, explaining the high number of events.

As I pointed out if the volume of the events is a problem, would decreasing the sampling rate help?

Change 531489 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::druid_load: absent readingdepth job

https://gerrit.wikimedia.org/r/531489

Discussed with @kzimmerman today and decided the best option forward would be to transfer ownership of the schema to the product analytics team (@mpopov, @jlinehan)

@Groceryheist with our very limited resources (more so this year than in years past) we really cannot afford to maintain streams this high volume that have no immediate use, and, even at low volume, it is my impression that nobody at the WMF is actually relying on this data.

I understand the research community finds data of interest but ideally we would like to concentrate our very limited resources in data streams that have wider usage. As @kzimmerman mentioned above Reading_depth has produced interesting findings in the desktop environment, from brief inspection (and I think the team already knows this) its instrumentation code would need to be modified to work on mobile devices (the base assumption that you know when a user is done reading a page by means of using the "beforeunload" event breaks in mobile [1] [2]).
The more standard SessionLength metric (that does not include page information) will be a lot more lightweight and would work in mobile and desktop. There are couple structural changes that need doing before this metric can be implemented though, one of them is this one: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventLogging/+/524575/

[1] https://www.igvita.com/2015/11/20/dont-lose-user-and-app-state-use-page-visibility/
[2] https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/modules/all/ext.wikimediaEvents.readingDepth.js#L249

Moving this off Reader's Webs board the original request of the team has been fulfilled.

Nuria renamed this task from Reading_depth: remove eventlogging instrumentation to Reading_depth: deactivate eventlogging instrumentation.Aug 29 2019, 2:22 PM
Nuria closed this task as Resolved.
Krinkle added a project: Performance-Team.
Krinkle subscribed.

The production payload for readingDepth.js is still being transferred and parsed on all page views. Can this be removed?

Change 626116 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/WikimediaEvents@master] ReadingDepth: Remove ReadingDepth instrument

https://gerrit.wikimedia.org/r/626116

The production payload for readingDepth.js is still being transferred and parsed on all page views. Can this be removed?

To be clear, disabling the instrumentation must include not sending it to the client.

Whoops… I'll submit a patch momentarily.

Change 626116 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] ReadingDepth: Remove ReadingDepth instrument

https://gerrit.wikimedia.org/r/626116

I've updated the schema's documentation/talk page on metawiki.

phuedx claimed this task.