This is the backend task corresponding to the client-side instrumentation work described in T184793: [EPIC] Instrument page interactions.
As decided in the recent Analytics-l thread ("How best to accurately record page interactions in Page Previews"), we want to record and aggregate the events generated by this instrumentation in a form consistent with with our existing content consumption measurement, concretely: the fields available in the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly table. (There remain different in opinions whether we should extend the existing table - as e.g. outlined here - or create a separate one.)
Here is what this means in detail: we will record the fields corresponding to the folllowing from pageview_hourly:
project string Project name from requests hostname language_variant string Language variant from requests path (not set if present in project name) page_title string Page Title from requests path and query
(for page previews, this will be extracted from the Eventlogging event instead)
access_method string Method used to access the pages, can be desktop, mobile web, or mobile app
This will always be desktop for now, although we may want to keep the field in case we want to extend this to e.g. measure link previews on the apps.
zero_carrier string Zero carrier if pageviews are accessed through one, null otherwise
This one is probably not essential, given that most zero pageviews happen on mobile web, and other considerations.
agent_type string Agent accessing the pages, can be spider or user
continent string Continent of the accessing agents (computed using maxmind GeoIP database) country_code string Country iso code of the accessing agents (computed using maxmind GeoIP database) country string Country (text) of the accessing agents (computed using maxmind GeoIP database) subdivision string Subdivision of the accessing agents (computed using maxmind GeoIP database) city string City iso code of the accessing agents (computed using maxmind GeoIP database) user_agent_map map<string,string> User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values record_version string Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly view_count bigint number of pageviews
(i.e. here number of page previews, mutatis mutandis)
page_id int MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview. namespace_id int MediaWiki namespace_id for this page title. [...] year int Unpadded year of pageviews month int Unpadded month of pageviews day int Unpadded day of pageviews hour int Unpadded hour of pageviews
The basic principle is that if a reader visits a page and then uses the page preview feature on that page to read preview cards, all the above metadata fields should have identical values for both the preview and the pageview. (save the obvious exotic cases like the reader travelling to another country between opening the page and opening the preview ;)
To the above, we will want to add the information about the page from which the preview is being viewed (basically the referrer), in the same format as for the page being consumed in preview form; say as source_page_title, source_page_id, and source_namespace_id.