Page MenuHomePhabricator

[Iceberg] Update Refine Sanitize to insert into Iceberg tables
Open, Needs TriagePublic5 Estimated Story Points

Description

Refine currently uses a DataFrameToHive class that automates schema evolution and insertion into Hive tables from a Spark DataFrame. We want to implement similar functionality for Iceberg, so likely a DataFrameToIceberg class. In the future, we could unify this kind of connector interface, but for now, a standalone Icerberg implemenation will be fine.

We will then use this to make a new (or adapted) RefineSanitize job that will read from event Hive tables and write to new iceberg tables in a new event_sanitized_iceberg database. This database will eventually replace event_sanitized.

Event Timeline

Ottomata renamed this task from [Iceburg] Update Refine Sanitize to insert into Iceberg tables to [Iceberg] Update Refine Sanitize to insert into Iceberg tables.Jun 30 2022, 3:53 PM

Change 811212 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] [WIP] Update refine to use Iceberg for event_sanitize

https://gerrit.wikimedia.org/r/811212

There are two issues currently in moving the sanitized data to iceberg:

  • The meta.dt field is sanitized, preventing to use it as official timestamp for the event while it is more reliable than dt
  • The meta.uuid field is sanitized, preventing to use it in case of merge (when rerunning a job, we want to overwrite existing data, which is done in iceberg using merge with an ID)

@Ottomata are there old eventlogging schemas not yet migrated to eventgate that would not have those two fields?

Ya there are some: T282131: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned. It looks a couple of the ones mentioned here do have some sanitize allow list in analytics/refinery.

Also all the ones in the 'unknown' tab of the Audit Spreadsheet, which is mostly Mobile Apps. IIRC, they are not migrating these, but will be decommissioning them and making new ones based on metrics platform. Don't know the timeline. I do see MobileWikiApp schemas mentioned in the sanitize allowlist too.

Eventually, a schema will be migrated or decommissioned. Once migrated, they will have those fields. Right now, though, ya there are some that don't.