Page MenuHomePhabricator

Triage and Report on Unique Devices Data Issue
Closed, ResolvedPublic5 Estimated Story Points

Description

During the migration of the unique devices data pipeline jobs from Oozie/Hive to Airflow/Spark (T305841) we found differences in the data produced by Hive vs. Spark. We discovered that a Hive bug (HIVE-21778) was affecting the computation of unique devices (code). It was determined that due to the bug, unique devices metrics have been overcounted since Feb 2021, when the Hadoop cluster was upgraded from CDH to BigTop (T244499). The differences appear only in the unique_devices_per_project_family data sets, and only in ~50% of the data (reference); greater in smaller buckets like “Wikiversity in Taiwan” (overcounting 51.95%), but still significant in bigger buckets like “Wikipedia in the US” (overcounting 5.99%).

This task is to track work associated with triaging and reporting the data issue.

Event Timeline

EChetty set the point value for this task to 5.Sep 5 2022, 4:15 PM

@odimitrijevic The documentation is ready, what is the next step here? Where should this be published to make sure the community will have access?

I have found some interesting information that correlates with our idea that the problem starts occurring after we migrate to BigTop (2021-02-19):

This leads me to conclude that the problem got introduced with our BigTop migration.