Page MenuHomePhabricator

Define how we vet code & data for ongoing, automated ingestion in Druid
Closed, DeclinedPublic

Description

As Product Analysts, we will need access to automatically updated data in Druid. This data will be consumed by a mix of analysts and stakeholders, and data issues can direct people toward the wrong decisions and erode trust in our insights.

We want to mitigate fallout from data issues by defining and following a process to vet code that pipes data into Druid, as well as the data itself once it's in druid,

Event Timeline

Ottomata edited subscribers, added: Pchelolo, mforns, Nuria and 4 others; removed: Tbayer.

In T214093: Modern Event Platform: Schema Guidelines and Conventions we are discussing how to annotate event schemas with druid ingestion information, to ease automated ingestion into druid.

We are considering the following:

Each event schema field can be annotated with its druid ingestion type. Something like:


page_namespace:
  type: integer
  annotations:
    cube_type: [dimension]

time_visible_ms:
  type: integer
  annotations:
    cube_type: [time_measure]

bytes_added:
  type: integer
  annotations:
    cube_type: [measure]

(cube_type/olap_type naming still to be bike shed. I am trying to make this Druid agnostic, as we might also be able use this information in other OLAP-ey systems, like Prometheus).

I'd like it if we eventually got to a place where any event data with a schema that has cube_type annotated fields would automatically be ingested into Druid, both in realtime and later in batch (Lambda Arch style). For now though, We'd have a manual list of event stream / tables that we would be used to specific which datasets should be ingested into Druid. The ingestion job would look up the schema and generate the appropriate Druid ingestion spec.

Note that this approach would not cover all Druid event ingestion use cases. This would only have the ability to generate the most simple ingestion specs. Any dataset that needs extra stuff (like Druid transforms or custom aggregations) would need custom ingestion specs.

An issue with keeping the ingestion annotation in the event schema is that anyone who has merge rights to the analytics event schema repository will have the ability to change the ingestion information. For now Analytics team mitigates this by controlling the list of event streams to be ingested. Perhaps in the future we could make auto ingestion self serve somehow? I guess it depends on who actually has merge rights to change schemas.

An issue with keeping the ingestion annotation in the event schema is that anyone who has merge rights to the analytics event schema repository will have the ability to change the ingestion information.

Likely changing cube_type should be considered as backwards incompatible as changing a field's type, so that will at least help with this potential problem.

We wont' be automating ingesting event data into Druid now that it is queryable via Presto and Superset. Further discussion of schema/stream annotations is in T263672: Figure out where stream/schema annotations belong (for sanitization and other use cases).