- median visibleLength
- average visibleLength
- other quartiles and 95% percentile of visibleLength
- average logarithm of visibleLength
aggregated per hour or day (TBD).
(This selection is based on @Groceryheist's recommendations from the exploration of this data in the Reading time project and may still be refined. The purpose is to make these metrics available to be viewed as dashboards/charts in Turnilo - see T205562 - and Superset,)
The same metrics for the totalLength field would be nice to have as well. On the other hand, we can ignore the schema's other quantitative fields (firstPaintTime etc.) for this purpose.
Dimensions to be carried over:
- all relevant ones from the event capsule, including:
- browser (family, major, minor)
- OS (family, major, minor)
- webhost and wiki ( project)
- country (but no smaller subdivisions)
Background: In T205562, we found that Druid does not provide the capability to calculate these aggregates directly while ingesting the original EventLogging table (event.readingdepth), hence the need for this task.
- Ignore/drop the high-cardinality fields: pageTitle (and pageID and` revisionID` after they will be added), pageToken, sessionToken
- This is based on the action = "pageUnloaded" events only (i.e. we can ignorepageLoaded events ).
- Only use the data from the default sample (default_sample = true)