While looking into missing events from MP data from android_product_metrics_article_link_preview_interaction vs MEP data from android_article_link_preview_interaction (Superset chart) @cjming and I wanted to ascertain why we are getting event data with NULL values in page column (where action = navigate these events should have associated page values).
Additionally, In comparing events we can see that MEP tracks events by wiki in column wiki_db and all action = navigate events have this value populated, the corresponding MP column/value should be mediawiki.database which also is populated even when page column values are NULL.
However, when querying to match event counts by this column value between MP and MEP we find lots of anomalies and missing values. Hypothetically where we have events between the 2 datasets these values should match. Meaning we get counts from MEP for wikis that are not present in MP mediawiki.database. This could just be a symptom of missing data but should be part of what we look into - i.e. are we missing data from specific wikis and why? There are 366 distinct wiki_dbs in the MEP event data and only 209 in the MP event data (WHERE action = navigate)
There multiple threads to investigate because of these findings.
Questions to answer:
- What is the cause for events missing values in page column where action = navigate and/or is this related tangentially to missing events from MP dataset?
- Where is the origination of the value in the Android event populated in MEP field wiki_db (it is my understanding that it is based on the wiki language) and how does it relate to the value populated in page.content_language? Is the value in MP mediawiki.database from the same origin as MEP field wiki_db?
Related query:
Show all rows from android_product_metrics_article_link_preview_interaction WHERE page values are NULL
SELECT * FROM event.android_product_metrics_article_link_preview_interaction WHERE year=2024 AND month >= 03 AND day >= 01 AND regexp_like(agent.app_version_name, '2.7.50481-r-') AND action = 'navigate' AND page.content_language IS NULL AND DATE(from_iso8601_timestamp(dt)) < current_date AND DATE(from_iso8601_timestamp(dt)) > DATE '2024-04-03'
Count for these rows is 5850594 for latest production version since release.
SELECT COUNT(*) FROM event.android_product_metrics_article_link_preview_interaction WHERE year=2024 AND month >= 03 AND day >= 01 AND regexp_like(agent.app_version_name, '2.7.50481-r-') AND action = 'navigate' AND page.content_language IS NULL AND DATE(from_iso8601_timestamp(dt)) < current_date AND DATE(from_iso8601_timestamp(dt)) > DATE '2024-04-03'
Notably this query (WHERE mediawiki.database IS NULL) yields 0 results, so even while page is NULL there is record of event-related mediawiki.database - why is there a record for wiki but not actual page?
SELECT * FROM event.android_product_metrics_article_link_preview_interaction WHERE year=2024 AND month >= 03 AND day >= 01 AND regexp_like(agent.app_version_name, '2.7.50481-r-') AND action = 'navigate' AND page.content_language IS NULL AND mediawiki.database IS NULL AND DATE(from_iso8601_timestamp(dt)) < current_date AND DATE(from_iso8601_timestamp(dt)) > DATE '2024-04-03'
Related Android task: https://phabricator.wikimedia.org/T361267