Page MenuHomePhabricator

Investigate MP `page` column NULL values for events where we expect that data to be populated
Closed, ResolvedPublic5 Estimated Story Points

Description

While looking into missing events from MP data from android_product_metrics_article_link_preview_interaction vs MEP data from android_article_link_preview_interaction (Superset chart) @cjming and I wanted to ascertain why we are getting event data with NULL values in page column (where action = navigate these events should have associated page values).

Additionally, In comparing events we can see that MEP tracks events by wiki in column wiki_db and all action = navigate events have this value populated, the corresponding MP column/value should be mediawiki.database which also is populated even when page column values are NULL.

However, when querying to match event counts by this column value between MP and MEP we find lots of anomalies and missing values. Hypothetically where we have events between the 2 datasets these values should match. Meaning we get counts from MEP for wikis that are not present in MP mediawiki.database. This could just be a symptom of missing data but should be part of what we look into - i.e. are we missing data from specific wikis and why? There are 366 distinct wiki_dbs in the MEP event data and only 209 in the MP event data (WHERE action = navigate)

There multiple threads to investigate because of these findings.

Questions to answer:

  • What is the cause for events missing values in page column where action = navigate and/or is this related tangentially to missing events from MP dataset?
  • Where is the origination of the value in the Android event populated in MEP field wiki_db (it is my understanding that it is based on the wiki language) and how does it relate to the value populated in page.content_language? Is the value in MP mediawiki.database from the same origin as MEP field wiki_db?

Related query:
Show all rows from android_product_metrics_article_link_preview_interaction WHERE page values are NULL

SELECT
*
  FROM event.android_product_metrics_article_link_preview_interaction 
  WHERE year=2024 AND month >= 03 AND day >= 01
  AND regexp_like(agent.app_version_name, '2.7.50481-r-')
  AND action = 'navigate'
  AND page.content_language IS NULL
  AND DATE(from_iso8601_timestamp(dt)) < current_date
  AND DATE(from_iso8601_timestamp(dt)) > DATE '2024-04-03'

Count for these rows is 5850594 for latest production version since release.

SELECT
COUNT(*)
  FROM event.android_product_metrics_article_link_preview_interaction 
  WHERE year=2024 AND month >= 03 AND day >= 01
  AND regexp_like(agent.app_version_name, '2.7.50481-r-')
  AND action = 'navigate'
  AND page.content_language IS NULL
  AND DATE(from_iso8601_timestamp(dt)) < current_date
  AND DATE(from_iso8601_timestamp(dt)) > DATE '2024-04-03'

Notably this query (WHERE mediawiki.database IS NULL) yields 0 results, so even while page is NULL there is record of event-related mediawiki.database - why is there a record for wiki but not actual page?

SELECT
*
  FROM event.android_product_metrics_article_link_preview_interaction 
  WHERE year=2024 AND month >= 03 AND day >= 01
  AND regexp_like(agent.app_version_name, '2.7.50481-r-')
  AND action = 'navigate'
  AND page.content_language IS NULL
  AND mediawiki.database IS NULL
  AND DATE(from_iso8601_timestamp(dt)) < current_date
  AND DATE(from_iso8601_timestamp(dt)) > DATE '2024-04-03'

Related Android task: https://phabricator.wikimedia.org/T361267

Event Timeline

SNowick_WMF renamed this task from Investigate MP `page.content_language` value origin to match MEP column 'wiki_db` to Investigate MP `page` column NULL values for events where we expect that data to be populated .Apr 26 2024, 10:11 PM

Thanks @SNowick_WMF for the details of the page data issue.

I did a little preliminary digging around and some thoughts below inline:

What is the cause for events missing values in page column where action = navigate and/or is this related tangentially to missing events from MP dataset?

This will require more digging on my part and likely some conferring with @Dbrant to deduce what's going on.

Where is the origination of the value in the Android event populated in MEP field wiki_db (it is my understanding that it is based on the wiki language) and how does it relate to the value populated in page.content_language?

The Android event populated in MEP field wiki_db looks like it's derived from viewModel.pageTitle.wikiSite.dbName() in the case of LinkPreviewDialog and pageTitle.wikiSite.dbName() in the case of PageActivity.

Where as in MP, we have the "corresponding" (in quotes bec - see Qs below) field sourced from pageTitle.wikiSite.languageCode -- there are 3 overloaded methods that all use this:

Some questions on my own:

  • Should the MP field page.content_language be updated to be sourced from pageTitle.wikiSite.dbName() (to try to match MEP wiki_db) instead of pageTitle.wikiSite.languageCode?
  • Aren't these different pieces of data though? Your question is my question - should MEP wiki_db be related to MP page.content_language?
  • If the MP field page.content_language were updated to be sourced from pageTitle.wikiSite.dbName(), would this account for at least part of the discrepancies we're seeing in the data?

Is the value in MP mediawiki.database from the same origin as MEP field wiki_db?

MP mediawiki.database is sourced from WikipediaApp.instance.wikiSite.dbName() (i'm not sure if this should be dynamic? At the time of development, I thought not) while MEP wiki_db comes from pageTitle.wikiSite.dbName() which is explained above. But maybe these are the same? @Dbrant would you know if these two should return the same value?

cjming set the point value for this task to 5.

We are currently investigating data variance (event and unique user) counts between MP android_product_metrics_article_link_preview_interaction and MEP android_article_link_preview_interaction where MP counts are lower than MEP counts in certain conditions. Discussion with cjming and @Dbrant led to the following further investigation:

Looking at comparisons between datasets in more detail we see the most notable variance is with events where values in action= navigate (there are only 3 distinct possible action values in this dataset). The smaller variances with action= linkclick and action= cancel are within the acceptable range but we see where action= navigate this is clearly the source of our problem with missing events/uniques. See below for more details.

Next steps are to identify specific action= navigate events/variations where we can narrow down possible problems in app or MP layer to resolve the count variances. Will track progress here.


Comparing daily average variance by events shows the following: (By what percent are MP counts lower than MEP counts - indicating missing data). This is visualized in the MP v MEP Counts Daily Compare MP v MEP Counts Daily Compare Dashboard

Column action valueAvg. Unique User Count % Change MPvMEPAvg. Event Count % Change MPvMEP
action= navigate-41.70%-27.75%
action= linkclick-1.30%-1.05%
action= cancel-.79%-.65%

I believe I may have an answer to the question of missing events:
TLDR: a race condition in the Metrics Platform client library.

  1. When the app is launched, it initializes the MP client library. And when it's initialized, the library kicks off a network request to fetch the list of active stream configurations. This network call may take a considerable amount of time.
  1. At the same time, the user might proceed to navigate to an article. Or in fact, the app might have been launched as the result of an external link from a browser, which would navigate to that article immediately.
  1. This is a problem because: any MP events that are attempted before the stream configuration is received are dropped.

∎.


Here is some evidence to back this up:
Let's query the table for navigate events (comparing both MEP and MP tables), but let's separate them by source.

For a source of 6 (which means "switching the language of an article"), the numbers are actually very close (within ~3%). For a source of 9 (which means "originating from the Places feature"), the numbers are nearly identical! This is because, in both of these cases, the user has already spent at least a few seconds in the app, which would be enough time for stream configurations to get populated.

Whereas, for a source of 3 (which means "link from external browser app"), the numbers are wildly different, >70%. This is because the app opens and navigates to the link immediately, not giving enough time for stream configurations to be received.


I would propose a two-pronged solution (to be discussed further in T365179):

  • The MP client library should provide a way for a consumer to cache the stream configurations locally, via callback or similar, so that they're persisted across the app's lifecycle.
  • If the stream configurations have not been received yet, the library should enqueue events instead of dropping them.

leaving this ticket in Sign Off column so that we can track effects of remediations after Android's next release of the updated Java client library 2.7