Context
Metrics Platform is building an Instrument Configurator, which can dynamically declare Event Platform event streams in EventStreamConfig.
Currently, any stream declared in production EventStreamConfig config (dynamically or statically) will automatically have a Hive table created in the event database.
Data Engineering is building a Datasets Config for datasets in the analytics data lake (Hadoop) to manage declaration and maintenance of tables there.
- The current plan is for this to be gitops style config. Config will live in a git repository, and be served via an HTTP API.
- Tooling / jobs will apply this configuration, including creating and evolving tables.
- The Refine job is being refactored to handle automatic insertion of data, but not evolution of tables.
Without integrating event streams declaration and Datasets Config (keeping EventStreamConfig separate as is), event Hive tables will no longer be automatically created.
Questions to answer
- Are the Instrumentation Configurator user needs for dynamic event stream declaration worth the complexity?
- If not, how to proceed?
- If so, how to proceed?
- Are we sure we want to remove automatic event Hive table creation? There are unanswered questions in the Refine System Refactoring google doc.
To answer these questions, we need to understand the complexity involved in auto-syncing dynamic configuration to the static Datasets Config git repo, auto deploying that config, and auto-applying the Hive table creation and evolution.
The Data Engineering team also needs a better understanding of how the MPIC will use dynamically created streams, and why they are needed. (This is surely documented somewhere, we just need links and better understanding).
Done is
- Above questions are answered and documented on wiki
Possible solutions to investigate
TODO: update these as we learn more
A possible solution may be to bring event stream configuration into Datasets Config. This would allow us to automate declaration of Hive tables in Datasets Config (via git tooling & CI) from event streams.
Doing so will mean that Dynamic EventStreamConfig will no longer work as planned by Metrics Platform for Instrument Configurator.
Also, EventStreamConfig is multi-DC (as part of MW), and allows to configure streams in beta before testing them. People also use it in their MW development environments (via local configuration). It may be difficult to bring this into a centralized system.
We might work around this in prod by making EventStreamConfig lookup streams in Datasets Config API via the EventStreamConfig hook developed for MPIC, and allowing ESC to work as is in other cases.
Another solution may be to automate Hive table declaration in Datasets Config by polling EventStreamConfig HTTP API and making commits to Datasets Config repo, and auto-merge and auto deploy the commits.