Page MenuHomePhabricator

[Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator
Open, Needs TriagePublic5 Estimated Story Points

Description

Context

Metrics Platform is building an Instrument Configurator, which can dynamically declare Event Platform event streams in EventStreamConfig.

Currently, any stream declared in production EventStreamConfig config (dynamically or statically) will automatically have a Hive table created in the event database.

Data Engineering is building a Datasets Config for datasets in the analytics data lake (Hadoop) to manage declaration and maintenance of tables there.

Without integrating event streams declaration and Datasets Config (keeping EventStreamConfig separate as is), event Hive tables will no longer be automatically created.

Questions to answer

  • Are the Instrumentation Configurator user needs for dynamic event stream declaration worth the complexity?
    • If not, how to proceed?
    • If so, how to proceed?

To answer these questions, we need to understand the complexity involved in auto-syncing dynamic configuration to the static Datasets Config git repo, auto deploying that config, and auto-applying the Hive table creation and evolution.

The Data Engineering team also needs a better understanding of how the MPIC will use dynamically created streams, and why they are needed. (This is surely documented somewhere, we just need links and better understanding).

Done is

  • Above questions are answered and documented on wiki

Possible solutions to investigate

TODO: update these as we learn more


A possible solution may be to bring event stream configuration into Datasets Config. This would allow us to automate declaration of Hive tables in Datasets Config (via git tooling & CI) from event streams.

Doing so will mean that Dynamic EventStreamConfig will no longer work as planned by Metrics Platform for Instrument Configurator.

Also, EventStreamConfig is multi-DC (as part of MW), and allows to configure streams in beta before testing them. People also use it in their MW development environments (via local configuration). It may be difficult to bring this into a centralized system.

We might work around this in prod by making EventStreamConfig lookup streams in Datasets Config API via the EventStreamConfig hook developed for MPIC, and allowing ESC to work as is in other cases.


Another solution may be to automate Hive table declaration in Datasets Config by polling EventStreamConfig HTTP API and making commits to Datasets Config repo, and auto-merge and auto deploy the commits.

Event Timeline

Ahoelzl set the point value for this task to 5.Thu, Apr 4, 3:53 PM
Ottomata renamed this task from [Event Platform] [Spike] Develop a concept to apply Metrics Platform configurations to event stream configurations to [Event Platform] [Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.Thu, Apr 4, 6:13 PM
Ottomata updated the task description. (Show Details)
Ottomata renamed this task from [Event Platform] [Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator to [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.Thu, Apr 4, 6:17 PM