[Iceberg] Update Refine Sanitize to insert into Iceberg tables
Open, Needs TriagePublic5 Estimated Story Points
Actions

Assigned To

Authored By

	• EChetty
	Jun 30 2022, 3:27 PM

Description

Refine currently uses a DataFrameToHive class that automates schema evolution and insertion into Hive tables from a Spark DataFrame. We want to implement similar functionality for Iceberg, so likely a DataFrameToIceberg class. In the future, we could unify this kind of connector interface, but for now, a standalone Icerberg implemenation will be fine.

We will then use this to make a new (or adapted) RefineSanitize job that will read from event Hive tables and write to new iceberg tables in a new event_sanitized_iceberg database. This database will eventually replace event_sanitized.

Details

	Subject	Repo	Branch	Lines +/-
	Update refine to use Iceberg for event_sanitize	analytics/refinery/source	master	+1 K -134

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T311743 [Iceberg] Epic: Icebergify event_sanitized database
		Open		JAllemandou	T311739 [Iceberg] Update Refine Sanitize to insert into Iceberg tables

Event Timeline

• EChetty created this task.Jun 30 2022, 3:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2022, 3:27 PM

Ottomata updated the task description. (Show Details)Jun 30 2022, 3:46 PM

Ottomata mentioned this in T311737: [Iceberg] Migrate event_sanitized_iceberg to event_sanitized.Jun 30 2022, 3:51 PM

Ottomata renamed this task from [Iceburg] Update Refine Sanitize to insert into Iceberg tables to [Iceberg] Update Refine Sanitize to insert into Iceberg tables.Jun 30 2022, 3:53 PM

Ottomata added a parent task: T311743: [Iceberg] Epic: Icebergify event_sanitized database.Jun 30 2022, 3:55 PM

• EChetty moved this task from Backlog to To be Discussed on the Data-Engineering-Planning board.Jun 30 2022, 5:48 PM

• EChetty set the point value for this task to 5.Jun 30 2022, 6:18 PM

• EChetty moved this task from To be Discussed to Estimated/Discussed on the Data-Engineering-Planning board.

• EChetty moved this task from Estimated/Discussed to Sprint 01 on the Data-Engineering-Planning board.

• EChetty edited projects, added Data-Engineering-Planning (Sprint 01); removed Data-Engineering-Planning.

• EChetty moved this task from Ready to Next Up on the Data-Engineering-Planning (Sprint 01) board.Jul 4 2022, 11:39 AM

• EChetty assigned this task to JAllemandou.Jul 4 2022, 3:11 PM

• EChetty moved this task from Next Up to In progress on the Data-Engineering-Planning (Sprint 01) board.

Change 811212 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] [WIP] Update refine to use Iceberg for event_sanitize

https://gerrit.wikimedia.org/r/811212

gerritbot added a project: Patch-For-Review.Jul 5 2022, 8:46 AM

• EChetty moved this task from In progress to In code review on the Data-Engineering-Planning (Sprint 01) board.Jul 11 2022, 9:46 AM

JAllemandou moved this task from In code review to Ready to deploy on the Data-Engineering-Planning (Sprint 01) board.Jul 15 2022, 2:42 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering-Planning (Sprint 01).Jul 25 2022, 2:07 PM

• EChetty moved this task from Backlog to Estimated/Discussed on the Data-Engineering-Planning board.Jul 25 2022, 6:22 PM

• EChetty moved this task from Estimated/Discussed to Pipelines on the Data-Engineering-Planning board.Aug 15 2022, 8:42 AM

• EChetty moved this task from Pipelines to Estimated/Discussed on the Data-Engineering-Planning board.Aug 15 2022, 8:49 AM

• EChetty moved this task from Estimated/Discussed to Pipelines on the Data-Engineering-Planning board.

• EChetty added a project: Data Pipelines.

• EChetty moved this task from Backlog to Sprint 00 on the Data Pipelines board.Aug 15 2022, 8:52 AM

• EChetty edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.

• EChetty moved this task from Ready to In Progress on the Data Pipelines (Sprint 00) board.Aug 15 2022, 9:01 AM

• EChetty moved this task from In Progress to Ready to Deploy on the Data Pipelines (Sprint 00) board.Aug 16 2022, 2:30 PM

• EChetty moved this task from Ready to Deploy to Blocked/Paused on the Data Pipelines (Sprint 00) board.

• EChetty edited projects, added Data Pipelines (Sprint 01); removed Data Pipelines (Sprint 00).Sep 6 2022, 10:00 AM

• EChetty moved this task from Ready to Blocked/Paused on the Data Pipelines (Sprint 01) board.Sep 6 2022, 10:05 AM

• EChetty edited projects, added Data Pipelines; removed Data Pipelines (Sprint 01).Sep 6 2022, 10:07 AM

• EChetty moved this task from Backlog to Next Up (revisit every 2 sprints) on the Data Pipelines board.Sep 6 2022, 10:09 AM

• EChetty moved this task from Next Up (revisit every 2 sprints) to Sprint 02 on the Data Pipelines board.Sep 26 2022, 4:02 PM

• EChetty edited projects, added Data Pipelines (Sprint 02); removed Data Pipelines.

• EChetty moved this task from Sprint 02 to Next Up (revisit every 2 sprints) on the Data Pipelines board.Oct 17 2022, 11:16 AM

• EChetty edited projects, added Data Pipelines; removed Data Pipelines (Sprint 02).

There are two issues currently in moving the sanitized data to iceberg:

The meta.dt field is sanitized, preventing to use it as official timestamp for the event while it is more reliable than dt
The meta.uuid field is sanitized, preventing to use it in case of merge (when rerunning a job, we want to overwrite existing data, which is done in iceberg using merge with an ID)

@Ottomata are there old eventlogging schemas not yet migrated to eventgate that would not have those two fields?

• EChetty moved this task from Next Up (revisit every 2 sprints) to Sprint 03 on the Data Pipelines board.Oct 17 2022, 4:01 PM

• EChetty edited projects, added Data Pipelines (Sprint 03); removed Data Pipelines.

Ya there are some: T282131: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned. It looks a couple of the ones mentioned here do have some sanitize allow list in analytics/refinery.

Also all the ones in the 'unknown' tab of the Audit Spreadsheet, which is mostly Mobile Apps. IIRC, they are not migrating these, but will be decommissioning them and making new ones based on metrics platform. Don't know the timeline. I do see MobileWikiApp schemas mentioned in the sanitize allowlist too.

Eventually, a schema will be migrated or decommissioned. Once migrated, they will have those fields. Right now, though, ya there are some that don't.

• EChetty moved this task from Ready to Next Up on the Data Pipelines (Sprint 03) board.Oct 18 2022, 4:01 PM

• EChetty moved this task from Next Up to Blocked/Paused on the Data Pipelines (Sprint 03) board.Nov 2 2022, 4:57 PM

• EChetty moved this task from Sprint 03 to Next Up (revisit every 2 sprints) on the Data Pipelines board.Nov 7 2022, 10:26 AM

• EChetty edited projects, added Data Pipelines; removed Data Pipelines (Sprint 03).

xcollazo subscribed.Nov 23 2022, 9:40 PM

Problem statement for this use-case here: https://docs.google.com/document/d/1HVO4m8JG5mrYX9ltdvJdt8N3QVscspdNVbpUY2kKtyo/edit

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:59 PM

JArguello-WMF moved this task from Next Up (revisit every 2 sprints) to Backlog on the Data Pipelines board.Jun 30 2023, 5:48 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:53 PM

lbowmaker edited projects, added Data-Engineering; removed Data Engineering and Event Platform Team.Nov 10 2023, 2:49 PM

lbowmaker moved this task from Incoming (new tickets) to Icebox (not considered in current quarter) on the Data-Engineering board.Feb 16 2024, 8:52 PM