Create reading depth schema
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	ovasileva
	Jan 18 2017, 5:18 PM

Description

Create reading depth schema defined as time spent per page.

Acceptance Criteria

Create reading depth schema which includes the following as defined within the Popups schema:

pageTitleSource
namespaceIdSource
pageToken
sessionToken
isAnon
(new) skin
actions:
- pageLoaded
- (new) pageUnloaded

As well as (only for the pageUnloaded action):

totalLength - interval of time during which article was open (strictly, from first paint to current time).
visibleLength - interval of time during which article was open and content was visible (i.e. both painted/interactive and show in the open browser tab). Pause timer based on visibility:
- Start the timer when the page is visible.
- Pause the timer when the visibility changes to "hidden" or "unknown".
- When the page has loaded and is visible then start the timer; otherwise, pause the timer.
firstPaint (same as in NavigationTiming)
domInteractive (same as in NavigationTiming)

An event is only logged if the UA is sendBeacon-capable
A single event is logged with the above information when the user navigates away from the page or closes it.
- i.e. Log an event in an onbeforeunload event handler using sendBeacon.
The sampling rate is configurable, default to 0.005%
The sampling rate (and method) is documented at https://meta.wikimedia.org/wiki/Schema_talk:ReadingDepth#Sampling .
This is behind a feature flag.
This is implemented in the WikimediaEvents extension.
The schema documentation template is filled out at https://meta.wikimedia.org/wiki/Schema_talk:ReadingDepth .

NOTE: browsers which do not support visibility will be paused throughout - these will be filtered out in analysis.

Implementation Notes

This implementation of reading depth instrumentation should be the source of truth for the page and session tokens in Reading-maintained extensions. This is so that the Page Previews instrumentation, for example, can be tied to the reading depth instrumentation.

page_previews.js

function logEvent( data ) {
  var readingInstrumentationService = mw.reading.instrumentation;

  if ( instrumentation.isUserInSample() ) {

    // Instrumentation#getBaseEventData would returns an object that includes the page and session tokens.
    data = $.extend( true, {}, instrumentation.getBaseEventData(), data );

    mw.track( 'event.Popups', data );
  }
}

Details

Subject	Repo	Branch	Lines +/-
Introduce the reading depth schema	mediawiki/extensions/WikimediaEvents	master	+147 -1
Hygiene: Remove checkin instrumentation	mediawiki/extensions/Popups	master	+54 -800
Hygiene: Rename isSendBeaconCapable	mediawiki/extensions/WikimediaEvents	master	+7 -10
wme: Set ReadingDepth sampling rate to 0.1%	operations/mediawiki-config	master	+10 -0
Track document visibility in reading depth schema	mediawiki/extensions/WikimediaEvents	master	+122 -21
Hygiene: Use mw.track to log events	mediawiki/extensions/WikimediaEvents	master	+2 -3
Enable ReadingDepth logging on Wikipedias	operations/mediawiki-config	master	+5 -0
Log first paint in addition to other fields	mediawiki/extensions/WikimediaEvents	master	+30 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		kzimmerman	T148262 Vet and explore new readership engagement metric
		Resolved		• Tbayer	T155639 Create reading depth schema

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Tbayer see above ^

To summarise the gif:

User navigates to wiki/1 (triggers page loaded)
As soon as page starts loading switches tab (domInteractive triggered @674ms while window is hidden)
User goes back to "1" tab (triggers first paint @4010ms) and clicks a link after 254ms (triggers page unload with totalLength of session of 254ms)

Observed: firstPaintTime: 4010, totalLength of session: 254ms, visibleLength of session: -3699=-3.7s

This is what we were struggling to explain in our conversation:

Visible time is currently measured from the start of the page
totalLength from the firstPaint time.

It's technically correct if you want to capture time hidden. The session was visible for 254ms after first paint, but was hidden for 4010ms (254-4010=3756ms=3.7s) in a separate tab
(FYI the discrepancy in 57ms is worth checking but I believe due to microsecond provision calculation of chrome.loadTimes )

Three options here:

this is okay
push for a timeHidden field to avoid this confusion
ignore all time hidden before first paint / domInteractive

What would you prefer?

• Tbayer updated the task description. (Show Details)Feb 16 2017, 10:21 PM

I kind of thought we had sorted out that case already, but I just tried to clarify it in the task description (which already advised to "Start the timer when the page is visible" for visibleLength, not from the very start of the page load; I added that this "visible" should be interpreted to mean either firstPaint/domInteractive or the tab coming into focus for the first time, whichever comes later). I guess this is the same as option three.

I'm struggling to explain this but I'm hearing contradictory information here. I hope this set of examples will illustrate where the complexity lies.
There seems to be a conflict between:

Measuring from firstPaint / domInteractive
Trying to capture time paused while the tab is not focused

I think running through these examples would help everyone:
https://etherpad.wikimedia.org/p/sessionlength

(Note I'm keeping a timer for hidden time and subtracting this from total session length... I'm not tracking visible time so changing the description wording to reflect this would be helpful.)

I've tried to premeditate what we will want to do here and write patches for all options I can forsee:

Should be possible to squash one or both of these once we've defined expectations.

In T155639#3034896, @Jdlrobson wrote:

I'm struggling to explain this but I'm hearing contradictory information here. I hope this set of examples will illustrate where the complexity lies.
There seems to be a conflict between:

Measuring from firstPaint / domInteractive

Trying to capture time paused while the tab is not focused

I think running through these examples would help everyone:
https://etherpad.wikimedia.org/p/sessionlength

As discussed on IRC, I filled out the Etherpad yesterday afternoon to state the expected outcome in each of the examples. From these, it looks like the definition of totalLength is clear, but that there is still confusion about the definition of visibleLength. Perhaps it helps to express it differently: It's the duration of the time between page load and unload where the reader it not prevented from seeing the content by either the tab being out of focus or by the content not yet having been painted.

I understand from https://gerrit.wikimedia.org/r/#/c/336670/9 that this isn't resolved yet. @bmansurov and @Jdlrobson , please feel free to draft me again if there is further clarification needed about the definition of visibleLength, or also in case I can help to discuss in which form (calculation steps) to implement that definition. Or maybe the remaining work is entirely on the coding side at this point.

@Tbayer I think the definition of visibleLength is clear. While reviewing I wasn't able to figure out what was causing the issue of negative values (even after squashing the two follow-up patches). I'll take another look on Tuesday, but in the meantime I'd appreciate @phuedx's review when he's back to work too.

Thanks @bmansurov. It would make a great difference if we could still have the first data in the schema by the end of next week.

phuedx moved this task from Needs Code Review to Needs More Work on the Reading-Web-Sprint-92-🍜 board.Feb 21 2017, 10:54 AM

@Tbayer heads-up that @phuedx is enabling the first part of the patch today, i.e. the part without visibleLength.

Change 338966 had a related patch set uploaded (by Phuedx):
Enable "reading depth" logging

https://gerrit.wikimedia.org/r/338966

Nemo_bis subscribed.Feb 21 2017, 1:19 PM

@Tbayer: When I worked in the Growth team, we'd disable the instrumentation two weeks after it was deployed. Is two weeks long enough? Granted, we'll likely have to tweak the sampling rate but do we have a rough idea of when we're going to stop collecting data?

Change 338966 merged by jenkins-bot:
Enable ReadingDepth logging on Wikipedias

https://gerrit.wikimedia.org/r/338966

Mentioned in SAL (#wikimedia-operations) [2017-02-21T14:08:21Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Enable ReadingDepth logging on Wikipedias - T148262 T155639 (duration: 00m 45s)

What is the status of this? Why is it in needs more work?

I have a minor issue with Separation of Concerns in the logEvent in rEWMV6b16213ac756: Track document visibility in reading depth schema. As we discussed in today's standup, I'll fix up the change.

In T155639#3043823, @Jhernandez wrote:

What is the status of this? Why is it in needs more work?

Sorry @Jhernandez. I should've updated the task when I moved it into Needs More Work.

Thanks for the update!

Change 339162 had a related patch set uploaded (by Phuedx):
Hygiene: Remove checkin instrumentation

https://gerrit.wikimedia.org/r/339162

In T155639#3042724, @phuedx wrote:

@Tbayer: When I worked in the Growth team, we'd disable the instrumentation two weeks after it was deployed. Is two weeks long enough? Granted, we'll likely have to tweak the sampling rate but do we have a rough idea of when we're going to stop collecting data?

So this is not about an individual experiment or product decision, but about a new general reader engagement metric that will be calculated on an ongoing basis, like pageviews, unique devices, or the NavigationTiming schema. It will take a while until that metric is fully defined, vetted and has its reporting mechanisms (see also T148262), but we will want to be able to calculate it retroactively for some weeks at the point when it is rolled out.

Change 339185 had a related patch set uploaded (by Phuedx):
Hygiene: Use mw.track to log events

https://gerrit.wikimedia.org/r/339185

Change 339190 had a related patch set uploaded (by Phuedx):
Hygiene: Rename isSendBeaconCapable

https://gerrit.wikimedia.org/r/339190

phuedx moved this task from Needs More Work to Needs Code Review on the Reading-Web-Sprint-92-🍜 board.Feb 22 2017, 3:37 PM

Changes that require review:

@Tbayer: One question that I had when working on the current implementation: why log a pageLoaded and pageUnloaded event?

In T155639#3046870, @phuedx wrote:

@Tbayer: One question that I had when working on the current implementation: why log a pageLoaded and pageUnloaded event?

We discussed this briefly above (on Jan 26): T155639#2974220 f.

Change 339185 merged by jenkins-bot:
Hygiene: Use mw.track to log events

https://gerrit.wikimedia.org/r/339185

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-02-28_(1.29.0-wmf.14)).Feb 23 2017, 12:01 AM

Krinkle unsubscribed.Feb 23 2017, 4:14 AM

In T155639#3048119, @Tbayer wrote:

We discussed this briefly above (on Jan 26): T155639#2974220 f.

Hah! Thanks!

@Tbayer, @ovasileva: Just confirming that we're seeing ReadingDepth events being logged. This Grafana dashboard shows the number of ReadingDepth events being logged per minute, which is currently ~13.

@phuedx - good stuff. Do you know when the visibility stuff can get added?

@bmansurov: I've updated rEWMVf79e67edbb7f: Track document visibility in reading depth schema with an inline comment explaining why there's no need to be defensive in the resume function.

• bmansurov moved this task from Needs Code Review to Needs More Work on the Reading-Web-Sprint-92-🍜 board.Feb 24 2017, 5:30 PM

Change 336670 merged by jenkins-bot:
Track document visibility in reading depth schema

https://gerrit.wikimedia.org/r/336670

^ There was some confusion around the expected value of the firstPaintTime property in the scenario that @bmansurov wrote up in T155639#3034527. @bmansurov was observing Chrome (and Chromium's?) behaviour of delaying the first paint if the page isn't visible.

phuedx moved this task from Needs More Work to Ready for Signoff on the Reading-Web-Sprint-92-🍜 board.Feb 25 2017, 8:09 AM

@Tbayer fyi ^

Change 340095 had a related patch set uploaded (by Phuedx):
wme: Set ReadingDepth sampling rate to 0.1%

https://gerrit.wikimedia.org/r/340095

In T155639#3049310, @ovasileva wrote:

@phuedx - good stuff. Do you know when the visibility stuff can get added?

rEWMV423d0749418c: Track document visibility in reading depth schema will start riding the train tomorrow and will be deployed to all of the Wikipedias on Thursday, 2nd March at ~8 PM UTC.

Change 340095 merged by jenkins-bot:
wme: Set ReadingDepth sampling rate to 0.1%

https://gerrit.wikimedia.org/r/340095

Mentioned in SAL (#wikimedia-operations) [2017-02-27T10:34:00Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: wme: Set ReadingDepth sampling rate to 0.1% - T155639 (duration: 00m 40s)

^ For context:

09:09:44 <phuedx> we're currently logging about 2.5k readingdepth events per minute
09:11:58 <phuedx> i'm going to ask hashar to deploy a change that'll drop the rate to 20% of that
09:22:12 <phuedx> maybe not so huge but still a higher number than we expected
09:22:27 <phuedx> i think analytics wasn't happy about ~100 events/second
09:22:44 <phuedx> we're a little below half of that

Change 339190 merged by jenkins-bot:
Hygiene: Rename isSendBeaconCapable

https://gerrit.wikimedia.org/r/339190

Change 339162 merged by jenkins-bot:
Hygiene: Remove checkin instrumentation

https://gerrit.wikimedia.org/r/339162

Over to you @Tbayer! /cc @ovasileva

ovasileva added a project: Reading-Web-Sprint-93-🔍🔍🔍🔍🔍.Mar 1 2017, 6:03 PM

Thanks all! On Monday, @Zareenf began the work on vetting and exploring the data, she will share the results of some data quality checks soon (our internal scratchpad is here for those who are curious).

What's the date from which on the visiblelength field can be considered to contain valid data?

ovasileva moved this task from Needs Analysis to Ready for Signoff on the Reading-Web-Sprint-93-🔍🔍🔍🔍🔍 board.Mar 1 2017, 6:04 PM

In T155639#3065174, @Tbayer wrote:

What's the date from which on the visiblelength field can be considered to contain valid data?

Thursday, 2nd March.

In T155639#3006730, @Tbayer wrote:

....

We still need to fill out the SchemaDoc template on the talk page (cf. e.g. https://meta.wikimedia.org/wiki/Schema_talk:Popups ); I can do that once we have decided who to list as maintainers (Olga and Sam?). Also, if someone could document the sampling method there as well, that would be great.

These two things got lost - should have put them into the AC, I'll do that now. I have filled out the SchemaDoc template and put in @phuedx and myself as maintainers for now: https://meta.wikimedia.org/wiki/Schema_talk:ReadingDepth - feel free to make changes.

• Tbayer updated the task description. (Show Details)Mar 1 2017, 8:48 PM

Unfortunately it appears that the schema hasn't been recording new data for the last two days (since 07:11am UTC on February 27, to be precise).[1] We first thought we had missed a version update (i.e. a switch to a new table), but there is still only one table in the database,[2] i.e. no new table was created after these schema changes. On the other hand, the schema is still sending data per Grafana, and other schemas are recording data just fine.[3] Does anyone know what's going on?

[1]

SELECT MAX(timestamp) FROM log.ReadingDepth_16325045;
+----------------+
| MAX(timestamp) |
+----------------+
| 20170227071139 |
+----------------+
1 row in set (0.00 sec)

[2]

SHOW TABLES FROM log LIKE 'Reading%';
+--------------------------+
| Tables_in_log (Reading%) |
+--------------------------+
| ReadingDepth_16325045    |
+--------------------------+
1 row in set (0.00 sec)

[3]
e.g.:
SELECT MAX(timestamp) FROM log.NavigationTiming_16305090;
+----------------+
| MAX(timestamp) |
+----------------+
| 20170302011633 |
+----------------+
1 row in set (0.01 sec)

or (albeit with >2h of lag too right now):
SELECT MAX(timestamp) FROM log.Popups_15906495;
+----------------+
| MAX(timestamp) |
+----------------+
| 20170301224722 |
+----------------+
1 row in set (0.00 sec)

@Tbayer: Nope. Who's the best person to ping in Analytics/SRE? At least we've learned that good behavior Grafana dashboard isn't a proxy for "Logged events are ending up in the DB."

Update: Seems more events are coming in now, but still with a lag of more than 2 days (most recent recorded event right now is from 7:19pm UTC on February 27) . I'll ping the Analytics team on IRC.

SELECT MAX(timestamp) FROM log.ReadingDepth_16325045;+----------------+
| MAX(timestamp) |
+----------------+
| 20170227191946 |
+----------------+
1 row in set (0.00 sec)

SELECT LEFT(timestamp, 10) AS hour, COUNT(*) FROM log.ReadingDepth_16325045 WHERE LEFT(timestamp, 8) = '20170227' GROUP BY hour ORDER BY hour;
+------------+----------+
| hour       | COUNT(*) |
+------------+----------+
| 2017022700 |    95922 |
| 2017022701 |    95076 |
| 2017022702 |    95736 |
| 2017022703 |    92388 |
| 2017022704 |    94482 |
| 2017022705 |    89635 |
| 2017022706 |    86900 |
| 2017022707 |    93640 |
| 2017022708 |    98195 |
| 2017022709 |   102968 |
| 2017022710 |    75357 |
| 2017022711 |    26952 |
| 2017022712 |    27525 |
| 2017022713 |    29107 |
| 2017022714 |    32138 |
| 2017022715 |    32388 |
| 2017022716 |    32342 |
| 2017022717 |    31287 |
| 2017022718 |    29438 |
| 2017022719 |    10442 |
+------------+----------+
20 rows in set (2.63 sec)

Update: I didn't yet get a reaction from the Analytics team after pinging them in #wikimedia-analytics. Only elukey from Ops weighed in, exploring the possibility that there might be an issue with timestamps on Kafka (if I understand correctly) - but it looks like this still needs attention from someone familiar with the EventLogging pipeline.

In the meantime, more data has been coming in, but still only from February 27 and 28 - MAX(timestamp) is at 20170228122436 right now.

<elukey> HaeB: I am not an expert but I grepped ReadingDepth in the EL m4-consumer (that should be the one pushing data to mysql) and I can see logs like
01:00:47 eventlogging_consumer-mysql-m4-master-00.log:2017-03-02 08:11:30,284 [9522] (MainThread) Inserted 625 ReadingDepth_16325045 events in 1.148235 seconds
01:01:04 that have a timestamp relatively recent (not sure about the data)
01:01:47 and I also see some Warning: Duplicate entry log for ReadingDepth, but it wouldn't explain this
01:04:18 at this point it might be related to what data is stored in kafka
01:05:18 H<HaeB> T. Bayer elukey: interesting, any idea what "Duplicate entry log" might mean?
01:05:45 E<elukey> nope, but it doesn't match the timeline that you put in the task..
01:05:58 so it probably is a red herring
[...]
01:12:40 E<elukey> HaeB: I've started to tail the ReadingDepth topic on kafka and one of the last events has "timestamp": 1487880599
01:12:44 that is really weird
[...]
[...]
01:20:15 E<elukey> so whatever is producing to the ReadingDepth topic is definitely adding events with a weird timestamp
[...]
01:29:32 E<elukey> HaeB: all this theory is based on the assumption that my kafkacat command tails the Kafka topic logs. I *think* it does but I am not 100% sure
01:29:48 (otherwise I might be reading from the start of the log retention time)

Looks like a eventlogging (custom) replication issue, master has the latest events. It looks like events are rolling into the slave, but just lagging a lot. I'm not sure if we can do much about this other than wait for them to come in. The eventlogging MySQL setup is falling apart :( See also: T124307

In T155639#3068167, @Ottomata wrote:

Looks like a eventlogging (custom) replication issue, master has the latest events. It looks like events are rolling into the slave, but just lagging a lot. I'm not sure if we can do much about this other than wait for them to come in. The eventlogging MySQL setup is falling apart :( See also: T124307

As discussed on IRC - thanks for resolving this mystery! We'll just query master for now, but I hope the infrastructure can be fixed up even before the big transition that may happen next fiscal year.

Hi all. Below are initial results from the data quality checks (the queries and outputs are documented in the scratchpad @Tbayer linked). No significant anomalies have been found, so far, besides the ~16% of pageTokens appearing in only 1 row, which was kind of expected.

.01% of pageUnloaded actions have a negative value for totalLength

0% of pageUnloaded actions have a NULL value for totalLength

0% of duplicates in table (defined as having identical timestamp, event_totalLength , event_action, event_pageTitle, and event_pageToken fields)

.07% of pageUnloaded actions with implausibly large total length (over 14 days)

.62% of pageTokens appear in more than 2 rows (theoretically each pageToken should only appear in 2 rows - once in a pageLoaded action and once in a pageUnloaded action)

15.98% of pageTokens appear in only 1 row

For pageTokens which appear in only 1 row, 99% have a pageLoaded action (with no following pageUnloaded action) and 1% are pageUnloaded actions (with no prior pageLoaded action)

In T155639#3069855, @Zareenf wrote:

.07% of pageUnloaded actions with implausibly large total length (over 14 days)

I used to work with folk who would have 50 or so tabs open on one monitor. Honestly, the thought of that many tabs open at once makes me 🤢 . They'd also never reboot their computers…

.62% of pageTokens appear in more than 2 rows (theoretically[,] each pageToken should only appear in 2 rows - once in a pageLoaded action and once in a pageUnloaded action)

For these events, are the timestamps of the events with duplicate pageTokens the same? The pageToken property is generated as follows:

var pageToken = mw.user.generateRandomSessionId()
  + Math.floor( mw.now() ).toString() // In the case of the ReadingDepth instrumentation, mw.now always returns a high resolution time <http://caniuse.com/#feat=high-resolution-time>.
  + mw.user.generateRandomSessionId();

The API requirements that ReadingDepth has – the UA must support the Beacon, Navigation Timing, and Page Visibility APIs – mean that mw.user.generateRandomSessionId will use [window.crypto.getRandomValues](http://caniuse.com/#feat=getrandomvalues) and not Math.random. IANAA (I Am Not An Analyst) but it feels like .62% is unusually high given how the pageToken is constructed.

For pageTokens which appear in only 1 row, 99% have a pageLoaded action (with no following pageUnloaded action) and 1% are pageUnloaded actions (with no prior pageLoaded action)

I'm curious, is the first behaviour – pageLoaded not followed by pageUnloaded – isolated to a specific browser? Remembering that we only log events in UAs that support the Beacon API, this might be telling of a particular browser's implementation of the API.

Also, awesome!

Also, @Zareenf - have you had a chance to test the visibleLength? (the last portion of this task)

In T155639#3070304, @phuedx wrote:

In T155639#3069855, @Zareenf wrote:

.07% of pageUnloaded actions with implausibly large total length (over 14 days)

I used to work with folk who would have 50 or so tabs open on one monitor. Honestly, the thought of that many tabs open at once makes me 🤢 . They'd also never reboot their computers…

Right - in this case though, it would also have been implausible simply because of the pageloaded event would have occurred before we even started to record data...

In T155639#3070519, @ovasileva wrote:

Also, @Zareenf - have you had a chance to test the visibleLength? (the last portion of this task)

Recall that this data only started to come in yesterday ;) But yes, the plan is to run essentially the same checks on this field too.

In T155639#3070304, @phuedx wrote:

In T155639#3069855, @Zareenf wrote:

...

For pageTokens which appear in only 1 row, 99% have a pageLoaded action (with no following pageUnloaded action) and 1% are pageUnloaded actions (with no prior pageLoaded action)

I'm curious, is the first behaviour – pageLoaded not followed by pageUnloaded – isolated to a specific browser? Remembering that we only log events in UAs that support the Beacon API, this might be telling of a particular browser's implementation of the API.

This is tricky to determine currently, but with T153207 getting closer to completion, we may soon be able to check this more easily.

• Tbayer added a comment.Mar 3 2017, 5:21 PM

This comment was removed by • Tbayer.

• Tbayer added a subscriber: leila.Mar 7 2017, 6:43 AM

Jdlrobson removed a project: Patch-For-Review.Mar 7 2017, 5:59 PM

In T155639#3070304, @phuedx wrote:

In T155639#3069855, @Zareenf wrote:

...

.62% of pageTokens appear in more than 2 rows (theoretically[,] each pageToken should only appear in 2 rows - once in a pageLoaded action and once in a pageUnloaded action)

For these events, are the timestamps of the events with duplicate pageTokens the same? The pageToken property is generated as follows:
var pageToken = mw.user.generateRandomSessionId()
  + Math.floor( mw.now() ).toString() // In the case of the ReadingDepth instrumentation, mw.now always returns a high resolution time <http://caniuse.com/#feat=high-resolution-time>.
  + mw.user.generateRandomSessionId();
The API requirements that ReadingDepth has – the UA must support the Beacon, Navigation Timing, and Page Visibility APIs – mean that mw.user.generateRandomSessionId will use [window.crypto.getRandomValues](http://caniuse.com/#feat=getrandomvalues) and not Math.random. IANAA (I Am Not An Analyst) but it feels like .62% is unusually high given how the pageToken is constructed.

This may still be worth a closer look, but note that the 0.62% includes cases where more than one unloaded event was sent. @Zareenf checked two outliers with 376 and 161 events for the same pagetoken, respectively, and in both cases there was a single loaded event with the rest being unloaded events (see the doc).

It's possible the user closes the tab while they are offline no... in which case we will never record an unloaded event? I spend a lot of time closing my many tabs while im trying to obtain the wifi password at cafes for example.

Likewise a dropped connection going through a tunnel could mean we see a unloaded event but no loaded event.

Here are some of the analogous checks for visibleLength instead of totalLength:

0.06% of pageUnloaded actions have a negative value for visibleLength.
0% of pageUnloaded actions have a NULL value for visibleLength.

Also, a concrete look at some values being sent, with a few (not entirely unexpected) outliers, e.g. 94 hours. Overall I would say these look plausible at first glance, will do a full histogram later.

SELECT 100*SUM(IF(event_visibleLength < 0,1,0))/SUM(1) AS percent_negative, SUM(1) AS total FROM log.ReadingDepth_16325045 WHERE timestamp LIKE '2017031%' AND event_action = 'pageUnloaded';
+------------------+--------+
| percent_negative | total  |
+------------------+--------+
|           0.0570 | 503212 |
+------------------+--------+
1 row in set (13.04 sec)

SELECT 100*SUM(IF(event_visibleLength IS NULL,1,0))/SUM(1) AS percent_null, SUM(1) AS total FROM log.ReadingDepth_16325045 WHERE timestamp LIKE '2017031%' AND event_action = 'pageUnloaded';
+--------------+--------+
| percent_null | total  |
+--------------+--------+
|       0.0000 | 503303 |
+--------------+--------+
1 row in set (8.85 sec)


SELECT event_visibleLength FROM log.ReadingDepth_16325045 WHERE timestamp LIKE '20170314%' AND event_action = 'pageUnloaded' LIMIT 100;
+---------------------+
| event_visibleLength |
+---------------------+
|               14099 |
|                9127 |
|               20164 |
|              144599 |
|                7800 |
|                1076 |
|                8255 |
|                1089 |
|             4714291 |
|               26608 |
|              195695 |
|               48969 |
|                4005 |
|               20888 |
|               84139 |
|                2596 |
|                7252 |
|               12644 |
|              117665 |
|               35160 |
|              253555 |
|               64707 |
|                2087 |
|                6403 |
|               99164 |
|               15694 |
|              163069 |
|               14177 |
|                3731 |
|              127416 |
|                4398 |
|                7959 |
|               34009 |
|              361663 |
|               70445 |
|               52429 |
|                8066 |
|               13722 |
|              161686 |
|               52060 |
|              440117 |
|               45374 |
|                6619 |
|               15529 |
|               70021 |
|               42383 |
|                9844 |
|               16319 |
|               18054 |
|               24527 |
|               23078 |
|               86738 |
|               33087 |
|               10788 |
|              135084 |
|            18293467 |
|                3370 |
|              260344 |
|               28035 |
|              105607 |
|            11307384 |
|               27704 |
|               30911 |
|               54168 |
|               20984 |
|              221244 |
|                2821 |
|               41124 |
|               62511 |
|               42242 |
|               41996 |
|               13823 |
|               72095 |
|              314622 |
|               51387 |
|                2373 |
|               52554 |
|              214603 |
|               31520 |
|              156229 |
|              165898 |
|               40175 |
|               18679 |
|                3170 |
|              506495 |
|               10323 |
|               20769 |
|               13163 |
|              211378 |
|               19101 |
|                8252 |
|               84892 |
|               36456 |
|              144721 |
|               11893 |
|                4922 |
|             1738815 |
|              660256 |
|               37412 |
|           338753445 |
+---------------------+
100 rows in set (0.00 sec)

In T155639#3098591, @Jdlrobson wrote:

It's possible the user closes the tab while they are offline no... in which case we will never record an unloaded event? I spend a lot of time closing my many tabs while im trying to obtain the wifi password at cafes for example.

Likewise a dropped connection going through a tunnel could mean we see a unloaded event but no loaded event.

Yes, absolutely - recall that this was part of our motivation to include the pageloaded events in the schema (T155639#2974220 etc.)

Per discussion with @ovasileva I'm closing this task now and opening a separate task for the remaining data quality checks, because they exceed a mere sign-off, and should be part of Reading-analysis rather than the web sprints.

To recap: The schema is up, sending data, and passed several initial data quality checks, with some minor problems uncovered - concretely, negative values being sent for visibleLength and totalLength in a small number of cases, and too many events being sent for the same pageToken (also rarely, although it may need to be taken into account during analysis).

• Tbayer mentioned this in T160492: Conduct further data quality checks on the ReadingDepth schema.Mar 15 2017, 3:55 AM

• Tbayer mentioned this in T156844: Decommission old dbstore hosts (db1046, db1047).Mar 17 2017, 10:58 PM

• Tbayer mentioned this in T174815: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana.Sep 4 2017, 9:48 PM

Capturing a comment on the ticket from @Krinkle - are we seeing any issues with the pageToken ? It seems like we construct a pageToken from two calls to user.generateRandomSessionId() and the current timestamp which seems very strange.

@Jdlrobson Should this be a separate task?

@Krinkle given this task reappearing a year after its deploy it would confuse people so yes. I've created T191248 so it can be talked about.

MBinder_WMF moved this task from 2016-17 Q3 to 2017-18 Q4 on the Web-Team-Backlog board.Jul 4 2018, 6:03 PM

Krinkle mentioned this in T214444: Update ReadingDepth instrumentation to avoid deprecated schema module (blocks loads event).Jan 22 2019, 10:48 PM

ovasileva edited projects, added Reading Depth; removed MW-1.29-release (WMF-deploy-2017-02-28_(1.29.0-wmf.14)).Feb 25 2019, 4:33 PM

Krinkle mentioned this in T229042: Reading_depth: deactivate eventlogging instrumentation.Sep 9 2020, 1:14 AM

	F5642428: Reading depth schema whiteboarding.jpg
	Feb 15 2017, 4:53 PM

	F5626713: P_20170213_153856_vHDR_Auto.jpg
	Feb 14 2017, 12:45 AM

	F5658108: visibility.gif
	Feb 16 2017, 8:57 PM

Create reading depth schemaClosed, ResolvedPublic5 Estimated Story PointsActions