During the migration of the unique devices data pipeline jobs from Oozie/Hive to Airflow/Spark (T305841) we found differences in the data produced by Hive vs. Spark. We discovered that a Hive bug (HIVE-21778) was affecting the computation of unique devices (code). It was determined that due to the bug, unique devices metrics have been overcounted since Feb 2021, when the Hadoop cluster was upgraded from CDH to BigTop (T244499). The differences appear only in the unique_devices_per_project_family data sets, and only in ~50% of the data (reference); greater in smaller buckets like “Wikiversity in Taiwan” (overcounting 51.95%), but still significant in bigger buckets like “Wikipedia in the US” (overcounting 5.99%).
This task is to track work associated with triaging and reporting the data issue.