Triage and Report on Unique Devices Data Issue
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	odimitrijevic
	Aug 29 2022, 5:15 PM

Description

During the migration of the unique devices data pipeline jobs from Oozie/Hive to Airflow/Spark (T305841) we found differences in the data produced by Hive vs. Spark. We discovered that a Hive bug (HIVE-21778) was affecting the computation of unique devices (code). It was determined that due to the bug, unique devices metrics have been overcounted since Feb 2021, when the Hadoop cluster was upgraded from CDH to BigTop (T244499). The differences appear only in the unique_devices_per_project_family data sets, and only in ~50% of the data (reference); greater in smaller buckets like “Wikiversity in Taiwan” (overcounting 51.95%), but still significant in bigger buckets like “Wikipedia in the US” (overcounting 5.99%).

This task is to track work associated with triaging and reporting the data issue.

Related Objects

Mentioned Here: T244499: Upgrade the Hadoop test cluster to BigTop
T305841: Migrate unique devices jobs

Event Timeline

odimitrijevic created this task.Aug 29 2022, 5:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2022, 5:15 PM

odimitrijevic assigned this task to mforns.Aug 29 2022, 5:15 PM

Mayakp.wiki subscribed.Aug 29 2022, 6:16 PM

The process has been documented in
https://docs.google.com/document/d/1Aj2oOZvwb6D6lm89XU_T9brtXApz5b7R00UtbzEiYkI/edit

• EChetty set the point value for this task to 5.Sep 5 2022, 4:15 PM

• EChetty edited projects, added Data Pipelines (Sprint 01); removed Data Pipelines (Sprint 00).Sep 6 2022, 10:00 AM

• EChetty moved this task from Ready to In Progress on the Data Pipelines (Sprint 01) board.Sep 6 2022, 10:02 AM

@odimitrijevic The documentation is ready, what is the next step here? Where should this be published to make sure the community will have access?

I have found some interesting information that correlates with our idea that the problem starts occurring after we migrate to BigTop (2021-02-19):

The problem happens due to a bug in the Hive Cost Based Optimizer (CBO) - see https://issues.apache.org/jira/browse/HIVE-21778
The CBO component has a switch allowing to turn it on or off - see https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.cbo.enable
The switch was set to false in our CDH distros (link sent for latest used version 5.16, and I have checked for previous versions) - see https://docs.cloudera.com/documentation/enterprise/6/properties/6.3/topics/cm_props_cdh5160_hive.html#concept_6.3.x_hiveserver2_props__section_performance_props
We have not manually changed the switch in our Hive configuration - see https://github.com/wikimedia/puppet/search?q=cbo.enable
We have disabled the switch for one job just after the migration, as it was causing issues - see https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668236/3/oozie/cassandra/daily/pageview_top_percountry.hql#29

This leads me to conclude that the problem got introduced with our BigTop migration.

Thank you @JAllemandou

• EChetty moved this task from In Progress to Blocked/Paused on the Data Pipelines (Sprint 01) board.Sep 12 2022, 4:05 PM

• EChetty edited projects, added Data Pipelines (Sprint 02); removed Data Pipelines (Sprint 01).Sep 26 2022, 12:56 PM

• EChetty moved this task from Ready to Next Up on the Data Pipelines (Sprint 02) board.Sep 29 2022, 12:24 PM

• EChetty moved this task from Next Up to Blocked/Paused on the Data Pipelines (Sprint 02) board.Oct 11 2022, 4:32 PM

JArguello-WMF reassigned this task from mforns to odimitrijevic.Oct 13 2022, 4:06 PM

JArguello-WMF added a subscriber: mforns.

• EChetty edited projects, added Data Pipelines (Sprint 03); removed Data Pipelines (Sprint 02).Oct 17 2022, 11:09 AM

• EChetty moved this task from Ready to Blocked/Paused on the Data Pipelines (Sprint 03) board.Oct 17 2022, 11:11 AM

• EChetty moved this task from Sprint 03 to Discussed (Radar) on the Data Pipelines board.Oct 17 2022, 3:11 PM

• EChetty edited projects, added Data Pipelines; removed Data Pipelines (Sprint 03).

The report has been published on: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues/2021-02-09_Unique_Devices_By_Family_Overcount

odimitrijevic closed this task as Resolved.Jan 4 2023, 9:58 PM

Triage and Report on Unique Devices Data IssueClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Triage and Report on Unique Devices Data Issue
Closed, ResolvedPublic5 Estimated Story Points
Actions