Migrate unique devices jobs
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	• EChetty
	Apr 11 2022, 1:11 PM

Description

Primary Task

Migrate the 4 unique_devices jobs to Airflow

Task Details:

Input	Processing	Output
Hive Table	Hive	Hive & Archive

Success Criteria:

Have the 2 Daily Jobs Migrated (SLA 6 Hours)
Have the 2 Monthly Jobs Migrated (SLA 1 Day)
Backfill the data to Casandra Druid and Hive

Details

	Subject	Repo	Branch	Lines +/-
	Fix unique-devices per project-family HQL	analytics/refinery	master	+10 -2
	Migrate unique devices queries to SparkSql and move to /hql	analytics/refinery	master	+760 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T307500 Airflow Hackathon (May 2022)
Resolved	mforns	T305841 Migrate unique devices jobs
Resolved	mforns	T319213 Backfill unique devices jobs [Tracking Task]

Event Timeline

• EChetty created this task.Apr 11 2022, 1:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 11 2022, 1:11 PM

mforns renamed this task from Migrate the Unique_devices jobs to Migrate unique devices jobs.May 4 2022, 12:45 PM

mforns added a project: Data-Engineering.

mforns updated the task description. (Show Details)

mforns added a parent task: T307500: Airflow Hackathon (May 2022).

mforns moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.May 11 2022, 6:45 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Jun 30 2022, 3:05 PM

• EChetty set the point value for this task to 3.Aug 16 2022, 3:16 PM

• EChetty moved this task from Discussed (Radar) to Sprint 00 on the Data Pipelines board.

• EChetty edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.

mforns claimed this task.Aug 17 2022, 3:13 PM

mforns moved this task from Ready to In Progress on the Data Pipelines (Sprint 00) board.

During the initial phase of this task we had some issues: The queries that currently compute unique devices metrics in Hive (with the Oozie job) didn't work in Spark3. The data is quite skewed and Spark has difficulties with it since it computes everything in memory (as opposed of Hive that can handle big skewed data more robustly).

With some query optimization we managed to make them work. However, the results differed from the ones computed using Hive. That made us think that the optimization was not correct, but after a couple days of troubleshooting, we saw that the issue was actually in Hive: a Hive bug which fails to properly evaluate NULL checks on Struct fields. The failing condition is the following: WHERE x_analytics_map IS NOT NULL. Hive was not evaluating this condition properly and was not filtering out records with x_analytics_map=NULL. So it seems the issue is with the existing Oozie/Hive job, and the unique_devices metrics that we currenlty have are not precise.

Here are the links for the hive bug and the corresponding Stackoverflow discussion:
https://issues.apache.org/jira/browse/HIVE-21778
https://dba.stackexchange.com/questions/271571/testing-a-hive-array-for-is-null-says-not-null

How (and how much) this affects the results of unique_devices metrics computation has to be yet detailed (TO DO), but we think the current metrics are somewhat higher than they should be, given than the base data to calculate them contains more rows than it should (because of the bug). Let's discuss this in stand-up and decide what are the next steps.

gmodena subscribed.Aug 29 2022, 10:32 AM

odimitrijevic mentioned this in T316572: Triage and Report on Unique Devices Data Issue.Aug 29 2022, 5:15 PM

Change 829862 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Migrate unique devices queries to SparkSql and move to /hql

https://gerrit.wikimedia.org/r/829862

gerritbot added a project: Patch-For-Review.Sep 5 2022, 8:59 PM

• EChetty edited projects, added Data Pipelines (Sprint 01); removed Data Pipelines (Sprint 00).Sep 6 2022, 10:01 AM

• EChetty moved this task from Ready to In Review on the Data Pipelines (Sprint 01) board.Sep 6 2022, 10:06 AM

• EChetty moved this task from Backlog to Pipelines on the Data-Engineering-Planning board.Sep 6 2022, 10:46 AM

• EChetty edited projects, added Data Pipelines; removed Data Pipelines (Sprint 01).

• EChetty moved this task from Backlog to Next Up (revisit every 2 sprints) on the Data Pipelines board.Sep 6 2022, 10:53 AM

• EChetty moved this task from Next Up (revisit every 2 sprints) to Sprint 01 on the Data Pipelines board.Sep 7 2022, 11:19 AM

• EChetty edited projects, added Data Pipelines (Sprint 01); removed Data Pipelines.

• EChetty moved this task from In Review to In Progress on the Data Pipelines (Sprint 01) board.Sep 12 2022, 4:04 PM

Change 829862 merged by Joal:

[analytics/refinery@master] Migrate unique devices queries to SparkSql and move to /hql

https://gerrit.wikimedia.org/r/829862

Maintenance_bot removed a project: Patch-For-Review.Sep 13 2022, 1:30 PM

JArguello-WMF moved this task from In Progress to In Review on the Data Pipelines (Sprint 01) board.Sep 15 2022, 4:09 PM

• EChetty edited projects, added Data Pipelines (Sprint 02); removed Data Pipelines (Sprint 01).Sep 26 2022, 12:56 PM

• EChetty moved this task from Ready to In Review on the Data Pipelines (Sprint 02) board.Sep 26 2022, 1:02 PM

• EChetty moved this task from In Review to Ready to Deploy on the Data Pipelines (Sprint 02) board.Sep 27 2022, 4:02 PM

GitLab changes associated with this task:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/140
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/163

JAllemandou moved this task from Ready to Deploy to In Review on the Data Pipelines (Sprint 02) board.Sep 29 2022, 1:35 PM

Change 836813 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Fix unique-devices per project-family HQL

https://gerrit.wikimedia.org/r/836813

gerritbot added a project: Patch-For-Review.Sep 29 2022, 1:35 PM

Change 836813 merged by Mforns:

[analytics/refinery@master] Fix unique-devices per project-family HQL

https://gerrit.wikimedia.org/r/836813

Maintenance_bot removed a project: Patch-For-Review.Sep 30 2022, 1:30 PM

• EChetty moved this task from In Review to Done on the Data Pipelines (Sprint 02) board.Oct 3 2022, 4:02 PM

• EChetty moved this task from Done to Ready to Deploy on the Data Pipelines (Sprint 02) board.

The unique devices jobs have been migrated to Airflow successfully!

There's one small issue regarding how timestamps are parsed in Spark vs Hive, which happens approx. 1 out of 1B times. We already have a fix for this, and it will be deployed soon, probably tomorrow at our weekly deployment train. After that we can call the migration of the unique devices jobs done.

Regarding the back-filling of the correct computation of unique devices data:
We have already re-run (back-filled) the correct metrics since 1st of July. This is the earliest we can back-fill, since older source data is not available any more.

However, the corrected data lives in Hive, and has not yet been loaded to Druid or Cassandra, the tools which serve the community via AQS and Wikistats2. We should do this next.

We'll create subtasks to re-load (back-fill) Cassandra and Druid.

• EChetty updated the task description. (Show Details)Oct 3 2022, 4:25 PM

• EChetty moved this task from Ready to Deploy to Done on the Data Pipelines (Sprint 02) board.Oct 5 2022, 4:03 PM

• EChetty closed this task as Resolved.Oct 13 2022, 2:53 PM

• EChetty closed subtask T319213: Backfill unique devices jobs [Tracking Task] as Resolved.