Web team has deployed the instrumentation to track read depth on talk page. (T294777) The related events are stored in a schema mediawiki_reading_depth. Sample rate is 0.1% on English Wikipedia. We need to add some fields to the allowlist so that the data won't get purged after 90 days.
Schema Documation
Here is our proposal of event sanitization.
Proposal
What NOT to keep
- http
- meta
- id
- request_id
- stream
- user_agent_map
What to keep
- access_method
- action: pageLoaded or pageUnloaded
- dom_interactive_time : Total length of time (in milliseconds) till DOM interactive event which is the point at which the browser has finished parsing the HTML and DOM construction is complete
- dt
- first_paint_time: Total length of time (in milliseconds) till first paint which is the point at which the first pixels get displayed to the user.
- is_anon
- meta
- dt
- domain
- page_length: Total length of page text. Has rounded down to the first digit. Data example: 90000, 20000
- page_namespace
- session_token
- skin
- total_length : Total length of time (in milliseconds) from the visibility_listeners_time to the point at which the page is unloaded.
- visibility_listeners_time: Total length of time (in milliseconds) till visibility event listeners were added to the document which enable the tracking of the page's visibility as defined by the visibilitychange event
- visible_length: Total length of time (in milliseconds) from the visibility_listeners_time to the point at which the page was unloaded excluding the total time that the document was "hidden" as defined by the Page Visibility API
- normalized_host