Page MenuHomePhabricator

[Iceberg Migration] Define sensor concept and implementation plan
Closed, ResolvedPublic5 Estimated Story Points

Description

We need to define and finalize how we implement Iceberg table sensors.
The current proposal is to use a central table state store.

The task is to

  • verify if a state store is a scalable and practical approach and common Iceberg practise
  • what are the requirements for a state store
  • how would a practical state store implementation look like

Event Timeline

Ahoelzl renamed this task from [Iceberg Migration] Define sensor concept to [Iceberg Migration] Define sensor concept and implementation plan.Jan 11 2024, 2:08 AM
Ahoelzl updated the task description. (Show Details)
lbowmaker set the point value for this task to 5.Jan 17 2024, 12:48 PM

@BTullis , @brouberol , @Stevemunene I would like your feedback on this subject:

We are going to setup a new system status-store which would describe each fragment imported into a dataset.
The Hive metastore was taking care of this job for Hive tables.
But now that we have begun the migration to Iceberg tables, where the notion of time partitions is different. We need to find a place to track what has been imported into our tables.

This would solve the triggering of (Airflow) jobs by the completion of another import into a dataset.

The most likely solution to store those data will be an SQL DB. A justification document is in preparation. But I prefer to open the subject now.

The specs for this DB are tiny. Maybe 8GB memory, 20GB SSD, and 20 concurrent connections.

What's easier for you to set up in a first POC: a Maria or a PG?

Hi @Antoine_Quhen - we would be happy to support this.

We have a couple of shared database platforms that we use to provide state storage to Data-Platform services already. These are:

  • What is commonly referred to as the Analytics Meta MariaDB service - Currently running MariaDB version 10.4
  • Our bare-metal PostgreSQL service - Currently running PostgreSQL verion 13

It would be easy and quick to set up a named database on either of these systems for you.

The concern I have, which is somewhat orthogonal to the requirements of this task, is that neither of these shared systems is yet highly-available.
Both of these clusters (for want of a better word) comprise two bare-metal servers, one of which is the primary and the other is a replica.
However, we have not yet got in place systems for automating the fail-over from primary to replica; neither in response to an incident, nor in order to assist with planned operations.

I mention this now, since I would like you to be aware that this is the current situation with our shared database platforms. Scheduled downtime is currently a necessity.
All of our existing services that depend on our shared relational database systems (i.e. Aiflow, Hive, Druid, Superset, DataHub) currently have this limitation, so it's not that this 'state-store' would be an outlier, it would simply be in the same situation as our other services.

We do intend to improve our shared database systems in order to provide high-availability, as well as offering container-based databases (backed by Ceph) so the problem may well go away before you get out of the PoC phase, but I thought it worth mentioning for discussion, whilst you are looking at setting up a new service.