Page MenuHomePhabricator

Conduct further data quality checks on the ReadingDepth schema
Closed, ResolvedPublic

Description

Split off from T155639.
More details on this work are documented here and here.

E.g.

  • Having found that there are sometimes page tokens (pageviews) with > 2 events, check that at least there is only one pageloaded event among them each time

There are 0.0111% of events where page loaded happened more than once.

  • Check if such pageviews (tokens) with multiple unloaded events will need to be filtered out during analysis, or whether their impact is likely small enough to be ignored

0.311% of pageviews have more than one unloaded events. We are filtering them out in the Reading time project.

  • Consistency checks: totalLength and visibleLength should be less (apart from rounding errors) than the difference between timestamps of the loaded and unloaded events (@Zareenf already worked on these)

We have about 7% of cases where the lengths these are more than 2 seconds greater than the difference between the time stamps and about 1.5% of cases at more than 5 seconds apart.

We also observe periodicity in the distribution errors with a periodicity of about 40 seconds. We are investigating this further

  • Generate histogram for totalLength
  • Generate histograms for visibleLength

Here's a plot with of the distribution of the lengths (logged)

  • Investigate cause of periodicity in discrepancies between total/visibleLength and event timestamps.

This was an artifact of a bug parsing datetimes.
This plot shows the distribution of discrepancies with the bug fixed.

  • Investigate cause of total/visibleLength values that are negative or very large.

We did not see any patterns of negative values. However, we do observe a still-unexplained periodicity in the frequency of negative .
Only 0.0019% of events have a negative totallength and 0.010% of events have a negative visible length.
Only 2.77% of events have a total length greater than 1 hour, and 1.03% of events have a visible length greater than 1 hour.
Only 0.189% of events have a total length greater than 12 hours, and 0.460% of events have a visible length greater than 12 hours.

As these number are quite small, especially in the negative cases, I do not feel that urgent investigation of this matter is urgent.

(See also notes)

Event Timeline

Since this task was filed, the data from this schema has occasionally been explored further in various ways. To leave another little result here that serves as a quality check: totalLength is always greater than or equal to visibleLength, and equal (i.e. tab was visible during the entire pageview) more than 2/3 of the time.

SELECT lengthsign, COUNT(*) AS views FROM (SELECT SIGN(event.totalLength - event.visibleLength) AS lengthsign FROM event.readingdepth WHERE year = 2018 AND month = 8 AND day <= 28 AND event.action = 'pageUnloaded') AS viewslist GROUP BY lengthsign;

lengthsign	views
0.0	4011285
1.0	1851103
2 rows selected (34.339 seconds)
kzimmerman subscribed.

@Groceryheist closing this as resolved as it looks like the work laid out in the task has been done.