Page MenuHomePhabricator

Odd behavior in unique device counts
Closed, DuplicatePublic

Description

Quick summary: I noticed some anomalous data in the unique device counts and think it's related to the redirects_to_pageview webrequests (and maybe CentralNotice banners) but hit the end of my ability to figure out what was going on. Figured I'd pass it up in case it was a) unexpected behavior, and b) could be fixed.

Longer description: I was looking into whether the unique device counts broken out by domain (e.g., en.m.wikipedia.org as in wmf.unique_devices_per_domain_monthly) and those broken out by project_family (e.g., wikipedia as in wmf.unique_devices_per_project_family_monthly) had similar patterns in response to a question about how much double-counting occurs when you aggregate the device counts that are broken down by domain. Essentially, if you want to know monthly unique devices for e.g., the United States + Wikipedia, how much higher is the number that you get from aggregating all the different Wikipedia domains in the US from the domain_monthly table vs. just taking the count directly from the project_family_monthly table. When examining the data, however, I noticed that there were a few countries where the sum of the monthly domain data was somehow less than the monthly project_family number (which shouldn't be possible).

In particular, this undercounting in the project_family data seemed to be highest in the wikipedia and wikimedia domains and mostly around the uniques_underestimate number. I looked into it a little further and the only large difference between the queries generating the project_family and domain data is that the domain data includes just pageviews while the project_family data also includes redirects_to_pageviews. I grabbed data from a random day to see how common it was that a given actor_signature_per_project_family had redirects_to_pageviews but not pageviews and found some really interesting patterns. For the day in question (Feb 20), there were a bunch of countries where almost 50% of the actors had only redirects_to_pageviews and all of them were in Africa or South America. Then most other countries were <5% actors with pageviews to redirects only. While I assume it's technically possible to have legitimate redirects that cross projects, I had assumed that this should be incredibly rare.

So what's going on? I don't actually know but I looked at one example of an actor with only redirects_to_pageviews and the redirects were to Central Notice code on meta and were referred by normal pageviews from eswiki. So I'm not certain what's going but I have at least one hypothesis that CentralNotice on the wikis is somehow triggering redirects when it pulls resources from Meta without triggering a pageview. This squares with the weird geographic patterns in the data. I didn't try to verify that this is what was causing the overall behavior though so might be other mechanisms at play (or maybe I'm just misunderstanding the data).

Queries

# Comparing direct project_family with aggregated domain counts:
WITH disaggregated_sum AS (
    SELECT
      country,
      SUBSTRING_INDEX(SUBSTR(domain, 1, LENGTH(domain) - 4), ".", -1) AS project_family,
      SUM(uniques_estimate) AS sum_unique_disagg
    FROM wmf.unique_devices_per_domain_monthly
    WHERE
      year = '2021'
      AND month = '1'
      AND domain LIKE '%wikipedia%'
    GROUP BY
      country,
      SUBSTRING_INDEX(SUBSTR(domain, 1, LENGTH(domain) - 4), ".", -1)
),
aggregated_sum AS (
    SELECT
      country,
      project_family,
      SUM(uniques_estimate) AS sum_unique_agg
    FROM wmf.unique_devices_per_project_family_monthly
    WHERE
      year = '2021'
      AND month = '1'
    GROUP BY
      country,
      project_family
)
SELECT
  d.country,
  d.project_family,
  d.sum_unique_disagg AS unique_device_disagg,
  a.sum_unique_agg AS unique_device_agg
FROM disaggregated_sum d
INNER JOIN aggregated_sum a
  ON (d.country = a.country
      AND d.project_family = a.project_family)
# Looking at actors with just redirects_from_pageviews by country
spark.sql(f"""
WITH pv_rd_counts AS (
    SELECT
      actor_signature_per_project_family,
      geocoded_data['country'] AS country,
      SUM(IF(is_pageview, 1, 0)) AS num_pageviews,
      SUM(IF(is_redirect_to_pageview, 1, 0)) AS num_red_pageviews
    FROM wmf.pageview_actor
    WHERE
      year = 2021
      AND month = 2
      AND day = 20
      AND agent_type = 'user'
    GROUP BY
      actor_signature_per_project_family,
      geocoded_data['country']
)
SELECT
  country,
  COUNT(1) AS num_actors,
  SUM(IF(num_red_pageviews > 0 AND num_pageviews = 0,1,0)) AS num_red_only,
  SUM(IF(num_red_pageviews = 0 AND num_pageviews > 0,1,0)) AS num_pv_only,
  SUM(IF(num_red_pageviews > 0 AND num_pageviews > 0,1,0)) AS num_both
FROM pv_rd_counts
GROUP BY
  country
"""
).show(500, False)

Event Timeline

Milimetric subscribed.

Honestly I don't think I'll have time to look at this anytime soon, gotta finish up some quarter stuff and start two other big projects (Airflow and Data Governance). I'm unlicking this cookie and letting anyone else raise it in severity if needed.

Milimetric triaged this task as Medium priority.May 10 2021, 4:15 PM

ping Product-Analytics, any interest in taking a look at this?