Split off from T155639.
More details on this work are documented [[ https://meta.wikimedia.org/wiki/Research_talk:Reading_time/Work_log/2018-09-25#Consistency_Checks | here ]] and [[https://meta.wikimedia.org/wiki/Research_talk:Reading_time | here]].
E.g.
[x] Having found that there are sometimes page tokens (pageviews) with > 2 events, check that at least there is only one pageloaded event among them each time
There are 0.0111% of events where page loaded happened more than once.
[X] Check if such pageviews (tokens) with multiple unloaded events will need to be filtered out during analysis, or whether their impact is likely small enough to be ignored
0.311% of pageviews have more than one unloaded events. We are filtering them out in the [[ https://meta.wikimedia.org/wiki/Research:Reading_time | Reading time project]].
[X] Consistency checks: `totalLength` and `visibleLength` should be less (apart from rounding errors) than the difference between timestamps of the loaded and unloaded events (@zareenf already worked on these)
We have about 7% of cases where the lengths these are more than 2 seconds greater than the difference between the time stamps and about 1.5% of cases at more than 5 seconds apart.
We also [[ https://meta.wikimedia.org/wiki/Research_talk:Reading_time/Work_log/2018-09-25#/media/File:LengthErrorDistribution.png | observe periodicity ]] in the distribution errors with a periodicity of about 40 seconds. We are investigating this further
[X] Generate histogram for `totalLength`
[X] Generate histograms for `visibleLength`
[[ https://meta.wikimedia.org/wiki/Research_talk:Reading_time/Work_log/2018-09-25#/media/File:WP_ReadingTime_Distributions.png | Here's a plot with of the distribution of the lengths (logged) ]]
[X] Investigate cause of periodicity in discrepancies between total/visibleLength and event timestamps.
This was an artifact of a bug parsing datetimes.
[[https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Plot_reading_time_discrepencies_overall.png/600px-Plot_reading_time_discrepencies_overall.png | This plot shows the distribution of discrepancies with the bug fixed.]]
[X] Investigate cause of total/visibleLength values that are negative or very large.
We did not see any patterns of negative values. However, we do observe a still-unexplained periodicity in the frequency of negative .
Only 0.0019% of events have a negative totallength and 0.010% of events have a negative visible length.
Only 2.77% of events have a total length greater than 1 hour, and 1.03% of events have a visible length greater than 1 hour.
Only 0.189% of events have a total length greater than 12 hours, and 0.460% of events have a visible length greater than 12 hours.
As these number are quite small, especially in the negative cases, I do not feel that urgent investigation of this matter is urgent.
(See also [[https://docs.google.com/document/d/15PDd09AbFlrcr9hYWWcrqi7hup71jSUjxfywjXAZ2QU/edit | notes]])