Page MenuHomePhabricator

Migrate unique devices jobs
Closed, ResolvedPublic3 Estimated Story Points

Description

Primary Task
Migrate the 4 unique_devices jobs to Airflow
Task Details:
InputProcessingOutput
Hive TableHiveHive & Archive

Success Criteria:

  • Have the 2 Daily Jobs Migrated (SLA 6 Hours)
  • Have the 2 Monthly Jobs Migrated (SLA 1 Day)
  • Backfill the data to Casandra Druid and Hive

Event Timeline

mforns renamed this task from Migrate the Unique_devices jobs to Migrate unique devices jobs.May 4 2022, 12:45 PM
mforns added a project: Data-Engineering.
mforns updated the task description. (Show Details)
EChetty set the point value for this task to 3.Aug 16 2022, 3:16 PM
EChetty moved this task from Discussed (Radar) to Sprint 00 on the Data Pipelines board.
EChetty edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.

During the initial phase of this task we had some issues: The queries that currently compute unique devices metrics in Hive (with the Oozie job) didn't work in Spark3. The data is quite skewed and Spark has difficulties with it since it computes everything in memory (as opposed of Hive that can handle big skewed data more robustly).

With some query optimization we managed to make them work. However, the results differed from the ones computed using Hive. That made us think that the optimization was not correct, but after a couple days of troubleshooting, we saw that the issue was actually in Hive: a Hive bug which fails to properly evaluate NULL checks on Struct fields. The failing condition is the following: WHERE x_analytics_map IS NOT NULL. Hive was not evaluating this condition properly and was not filtering out records with x_analytics_map=NULL. So it seems the issue is with the existing Oozie/Hive job, and the unique_devices metrics that we currenlty have are not precise.

Here are the links for the hive bug and the corresponding Stackoverflow discussion:
https://issues.apache.org/jira/browse/HIVE-21778
https://dba.stackexchange.com/questions/271571/testing-a-hive-array-for-is-null-says-not-null

How (and how much) this affects the results of unique_devices metrics computation has to be yet detailed (TO DO), but we think the current metrics are somewhat higher than they should be, given than the base data to calculate them contains more rows than it should (because of the bug). Let's discuss this in stand-up and decide what are the next steps.

Change 829862 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Migrate unique devices queries to SparkSql and move to /hql

https://gerrit.wikimedia.org/r/829862

Change 829862 merged by Joal:

[analytics/refinery@master] Migrate unique devices queries to SparkSql and move to /hql

https://gerrit.wikimedia.org/r/829862

Change 836813 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Fix unique-devices per project-family HQL

https://gerrit.wikimedia.org/r/836813

Change 836813 merged by Mforns:

[analytics/refinery@master] Fix unique-devices per project-family HQL

https://gerrit.wikimedia.org/r/836813

The unique devices jobs have been migrated to Airflow successfully!

There's one small issue regarding how timestamps are parsed in Spark vs Hive, which happens approx. 1 out of 1B times. We already have a fix for this, and it will be deployed soon, probably tomorrow at our weekly deployment train. After that we can call the migration of the unique devices jobs done.

Regarding the back-filling of the correct computation of unique devices data:
We have already re-run (back-filled) the correct metrics since 1st of July. This is the earliest we can back-fill, since older source data is not available any more.

However, the corrected data lives in Hive, and has not yet been loaded to Druid or Cassandra, the tools which serve the community via AQS and Wikistats2. We should do this next.

We'll create subtasks to re-load (back-fill) Cassandra and Druid.