Page MenuHomePhabricator

[Oozie Migration] Migrate edit hourly job
Closed, ResolvedPublic

Description

Migrate Oozie's edit_hourly job to Airflow.
It consists of an HQL query (SparkSQLOperator)
and the generation of a success file (URLTouchOperator).
It's a monthly job. Its SLA should be 10 days.
Hopefully no gotchas :]

Details

TitleReferenceAuthorSource BranchDest Branch
Add analytics edit_hourly dagrepos/data-engineering/airflow-dags!237joaladd_analytics_edit_hourlymain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJAllemandou

Event Timeline

EChetty renamed this task from Migrate edit hourly job to [Oozie Migration] Migrate edit hourly job.Feb 7 2023, 5:04 PM
EChetty assigned this task to JAllemandou.

I experienced two reasons for data discrepancy between generating the edit_hourly dataset in Hive or in Spark3. While the first one is good to know for other migrations, the second one is actually a bug of the Hive version of our code.

  1. The ARRAY_CONTAINS function in spark3 returns null if the provided array is null, while the hive version returns false. To cover for this and obtain the same behavior I transformed the array_parameter into COALESCE(array_parameter, ARRAY()).
  2. We parsed timestamp in Hive using the following format: yyyy-MM-dd HH:mm:ss.sss - Spark3 raises an error for this format, as it doesn't expect the s symbol to be used for milliseconds when it means seconds. This error made change the parsing format to yyyy-MM-dd HH:mm:ss.SSS with capital S for milliseconds, which the correct version. Now that means that in hive, every parsed timestamp uses the value parsed for milliseconds in the second filed of the timestamp. For the edit_hourly dataset all timestamps have 0 milliseconds, making every timestamp rounded to the previous minute! For instance, 2023-02-16 11:23:30.0 was parsed into 2023-02-16 11:23:00.0 timestamp.

The version of edit_hourly released next month will have the bug corrected, changing ~30k rows of data over ~70m. The only impacted field is user_tenure, when the duration of the user-tenure is close enough to the next bucket so that the correct timestamp computation makes it pass the threshold.

Change 889766 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update hql edit hourly computation script

https://gerrit.wikimedia.org/r/889766

Change 889766 merged by Mforns:

[analytics/refinery@master] Update hql edit hourly computation script

https://gerrit.wikimedia.org/r/889766