Currently, users depend on datasets in Airflow via our custom dataset abstraction in wmf_airlfow_common. This abstraction allows users to declare and configure datasets in datasets config files, and then refer to them by name when instantiating sensors in their DAGs.
RestExternalTaskSensor was primarily developed to make it possible to sense readiness of Iceberg data.
It would be nice if users could continue to use the same dataset config and library to depend on Iceberg tables. However, this may be a little awkward with DatasetRegistry as is, since DatasetRegistry expects datasets to be in 'datastores', and RestExternalTaskSensor has no knowledge of the data output of the tasks it can sense on. It only knows when a referenced task is complete. Upon a cursory read, this limitation looks only to be one of naming concepts (e.g. datastore_to_dataset_map), and is not a functional limitation.
It seems possible to implement a 'RestExternalTaskSensorDataset' class (even though it isn't a dataset sensor) that would provide the desired functionality: using our dataset library to sense on readiness of Iceberg tables. If we wanted, we could abstract away the 'RestExternalTaskSensor' part of this Dataset implementation and call it 'IcebergDataset', but this might be awkward given that the configuration for an iceberg table dataset will have to target an airflow instance and task.
We should explore our options here and implement a solution.
Also: currently, datasets.yaml config files are isolated in each airflow instance. We do not have a global datasets.yaml file(s). We'll need this if we want to sense on Iceberg tables created by a different airflow instance, e.g. as done here.
To do this, we could make DatasetRegistry read every dataset.yaml file, or we could move dataset configuration up to a global level config file in the airflow-dags repo.