Split off from T155639.
More details on this work are documented [[ https://meta.wikimedia.org/wiki/Research_talk:Reading_time/Work_log/2018-09-25#Consistency_Checks | here ]] and [[https://meta.wikimedia.org/wiki/Research_talk:Reading_time | here]].
E.g.
[x] Having found that there are sometimes page tokens (pageviews) with > 2 events, check that at least there is only one pageloaded event among them each time
There are 0.0111% of events where page loaded happened more than once.
[X] Check if such pageviews (tokens) with multiple unloaded events will need to be filtered out during analysis, or whether their impact is likely small enough to be ignored
0.311% of pageviews have more than one unloaded events. We are filtering them out in the [[ https://meta.wikimedia.org/wiki/Research:Reading_time | Reading time project]].
[X] Consistency checks: `totalLength` and `visibleLength` should be less (apart from rounding errors) than the difference between timestamps of the loaded and unloaded events (@zareenf already worked on these)
We have about 7% of cases where the lengths these are more than 2 seconds greater than the difference between the time stamps and about 1.5% of cases at more than 5 seconds apart.
We also [[ https://meta.wikimedia.org/wiki/Research_talk:Reading_time/Work_log/2018-09-25#/media/File:LengthErrorDistribution.png | observe periodicity ]] in the distribution errors with a periodicity of about 40 seconds. We are investigating this further
[X] Generate histogram for `totalLength`
[X] Generate histograms for `visibleLength`
[[ https://meta.wikimedia.org/wiki/Research_talk:Reading_time/Work_log/2018-09-25#/media/File:WP_ReadingTime_Distributions.png | Here's a plot with of the distribution of the lengths (logged) ]]
[ ] Investigate cause of periodicity in discrepancies between total/visibleLength and event timestamps.
[ ] Investigate cause of total/visibleLength values that are negative or very large.
(See also [[https://docs.google.com/document/d/15PDd09AbFlrcr9hYWWcrqi7hup71jSUjxfywjXAZ2QU/edit | notes]])