Page MenuHomePhabricator

Conduct further data quality checks on the ReadingDepth schema
Closed, ResolvedPublic

Description

Split off from T155639.
More details on this work are documented here and here.

E.g.

  • Having found that there are sometimes page tokens (pageviews) with > 2 events, check that at least there is only one pageloaded event among them each time

There are 0.0111% of events where page loaded happened more than once.

  • Check if such pageviews (tokens) with multiple unloaded events will need to be filtered out during analysis, or whether their impact is likely small enough to be ignored

0.311% of pageviews have more than one unloaded events. We are filtering them out in the Reading time project.

  • Consistency checks: totalLength and visibleLength should be less (apart from rounding errors) than the difference between timestamps of the loaded and unloaded events (@Zareenf already worked on these)

We have about 7% of cases where the lengths these are more than 2 seconds greater than the difference between the time stamps and about 1.5% of cases at more than 5 seconds apart.

We also observe periodicity in the distribution errors with a periodicity of about 40 seconds. We are investigating this further

  • Generate histogram for totalLength
  • Generate histograms for visibleLength

Here's a plot with of the distribution of the lengths (logged)

  • Investigate cause of periodicity in discrepancies between total/visibleLength and event timestamps.

This was an artifact of a bug parsing datetimes.
This plot shows the distribution of discrepancies with the bug fixed.

  • Investigate cause of total/visibleLength values that are negative or very large.

We did not see any patterns of negative values. However, we do observe a still-unexplained periodicity in the frequency of negative .
Only 0.0019% of events have a negative totallength and 0.010% of events have a negative visible length.
Only 2.77% of events have a total length greater than 1 hour, and 1.03% of events have a visible length greater than 1 hour.
Only 0.189% of events have a total length greater than 12 hours, and 0.460% of events have a visible length greater than 12 hours.

As these number are quite small, especially in the negative cases, I do not feel that urgent investigation of this matter is urgent.

(See also notes)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2017, 3:55 AM
chelsyx moved this task from Triage to Next Up on the Product-Analytics board.May 24 2018, 8:23 PM

Since this task was filed, the data from this schema has occasionally been explored further in various ways. To leave another little result here that serves as a quality check: totalLength is always greater than or equal to visibleLength, and equal (i.e. tab was visible during the entire pageview) more than 2/3 of the time.

SELECT lengthsign, COUNT(*) AS views FROM (SELECT SIGN(event.totalLength - event.visibleLength) AS lengthsign FROM event.readingdepth WHERE year = 2018 AND month = 8 AND day <= 28 AND event.action = 'pageUnloaded') AS viewslist GROUP BY lengthsign;

lengthsign	views
0.0	4011285
1.0	1851103
2 rows selected (34.339 seconds)
Tbayer moved this task from Next Up to Doing on the Product-Analytics board.Sep 27 2018, 8:25 PM
Groceryheist updated the task description. (Show Details)Nov 17 2018, 9:59 PM
kzimmerman moved this task from Doing to Tracking on the Product-Analytics board.Jun 26 2019, 11:41 PM
kzimmerman closed this task as Resolved.Aug 20 2019, 9:35 PM
kzimmerman added a subscriber: kzimmerman.

@Groceryheist closing this as resolved as it looks like the work laid out in the task has been done.