Page MenuHomePhabricator

[Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator
Open, Needs TriagePublic5 Estimated Story Points

Description

Context

Metrics Platform is building an Instrument Configurator, which can dynamically declare Event Platform event streams in EventStreamConfig.

Currently, any stream declared in production EventStreamConfig config (dynamically or statically) will automatically have a Hive table created in the event database.

Data Engineering is building a Datasets Config for datasets in the analytics data lake (Hadoop) to manage declaration and maintenance of tables there.

Without integrating event streams declaration and Datasets Config (keeping EventStreamConfig separate as is), event Hive tables will no longer be automatically created.

Questions to answer

  • Are the Instrumentation Configurator user needs for dynamic event stream declaration worth the complexity?
    • If not, how to proceed?
    • If so, how to proceed?

To answer these questions, we need to understand the complexity involved in auto-syncing dynamic configuration to the static Datasets Config git repo, auto deploying that config, and auto-applying the Hive table creation and evolution.

The Data Engineering team also needs a better understanding of how the MPIC will use dynamically created streams, and why they are needed. (This is surely documented somewhere, we just need links and better understanding).

Done is

  • Above questions are answered and documented on wiki

Possible solutions to investigate

TODO: update these as we learn more


A possible solution may be to bring event stream configuration into Datasets Config. This would allow us to automate declaration of Hive tables in Datasets Config (via git tooling & CI) from event streams.

Doing so will mean that Dynamic EventStreamConfig will no longer work as planned by Metrics Platform for Instrument Configurator.

Also, EventStreamConfig is multi-DC (as part of MW), and allows to configure streams in beta before testing them. People also use it in their MW development environments (via local configuration). It may be difficult to bring this into a centralized system.

We might work around this in prod by making EventStreamConfig lookup streams in Datasets Config API via the EventStreamConfig hook developed for MPIC, and allowing ESC to work as is in other cases.


Another solution may be to automate Hive table declaration in Datasets Config by polling EventStreamConfig HTTP API and making commits to Datasets Config repo, and auto-merge and auto deploy the commits.

Event Timeline

Ahoelzl set the point value for this task to 5.Apr 4 2024, 3:53 PM
Ottomata renamed this task from [Event Platform] [Spike] Develop a concept to apply Metrics Platform configurations to event stream configurations to [Event Platform] [Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.Apr 4 2024, 6:13 PM
Ottomata updated the task description. (Show Details)
Ottomata renamed this task from [Event Platform] [Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator to [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.Apr 4 2024, 6:17 PM

The Data Engineering team also needs a better understanding of how the MPIC will use dynamically created streams, and why they are needed. (This is surely documented somewhere, we just need links and better understanding).

Pasting some comments from a slack convo that expands on this question.

They [Metrics Platform] want to decouple deployments of code (instruments) vs its config. They also want short feedback loops,

What I don’t understand is why they need dynamic stream config to do this.

  • Instrumentation is code, and requires code deployment, no matter what (unless I’m wrong and they have something very fancy). E.g. if they make a new feature, they will also have to deploy code that instruments that feature. The feedback loop is not shortened by not having to deploy config.
  • However, once instrumentation is deployed, perhaps people want to easily change parameters of that instrumentation. That sounds nice. MPIC justified. ✅
  • One can make a UI to change theses parameters used by instrumentation to produce events, including the value of meta.stream. But, in order to change meta.stream, the UI needs to be able to dynamically declare that stream in ESC.
  • But what value does this actually provide? For any given instrument, keeping meta.stream the same (or choosing from a set of limited streams declared in ESC) seems sufficient to me. Short feedback loops can be provided by varying a different parameter that causes the instrumentation to set a field in their event. This field can be filtered on during analysis.
  • It is pretty neat to be able to make the instrument send to any stream, which means the instrument can populate any Hive table.
  • But is it required to support their goal of shorter feedback loops? And if it is not required, is it worth the complexity of dynamic stream config?

This is what I’d love to understand and have documented, before we decide to support it! (even though it is probably too late because they already implemented it!)

It [MPIC] needs ESC for producer configs, sample rate being the example I see usually cited.

This makes sense. But I thought that sample rate and other instrumentation/producer/client config was going to be handled by MPIC, and removed from ESC.

@gmodena wrote:

Your raise valid points. But tbh I don't think any of these are factor they considered for dynamic streams. That design choice was driven by the wish of making stream creation and deletion (emphasis mine) more user friendly:
https://docs.google.com/document/d/1U-CweFJ-5Px7u8yGMU_3hVKePxdn0OXAKKuhHIwmejA/edit#heading=h.fw5awziuoj1r
Event Stream Creation & Configuration CP enables:

  • the ability to deploy and configure EventStreamConfig without bespoke deployments
  • the ability to set sampling rates per stream
  • the ability to turn on/off and renew expired streams

The second 2 points make sense, and could be done without dynamic ESC. However, I don't see why the first point is required for their needs; it seems more like it would be a solution, but I think there are other solutions (as stated above) that introduce less complexity.

+1 on moving producer/client configs outside of ESC, but I am not a-prio against their approach. If they wanted to keep sample rate in ESC for backward compat, without refactoring ESC, there could have been a risk of duplicating stream settings.

Hm, but they are already introducing the ability to configure instruments outside of ESC, and they have access to ESC, so I'm not sure why they couldn't keep backwards compatibility without dynamic ESC.


I am surely missing something! Perhaps this is all documented in some doc or Miro board or conversation, but I haven't seen it yet. I feel like I've been asking this question for at least 4 months (perhaps longer if we go back to the Control Plane discussions we had last June), and have not received an answer.

In short:

  • Goal: shorter feedback loops.
  • Solution 1: Dynamic ESC: every new experiment(?) goes to a new stream
  • Solution 2: Static ESC with pre-declared streams: new experiments are ID-ed by a field in the event, and analysis for that experiment filters on that field.

Why is Solution 1 better than Solution 2? From my perspective, solution 2 is better because it is less complex. Solution 2 also seems better because experiments from the same instrumentation can be compared using filters, rather than joins different between streams & tables.

This comment is the result of some time spent collecting info from various stakeholders, and reviewing documentation and decision records.
It was initially shared as a Google doc (now moved to read only to preserve comment history).

Questions to answer

Are the Instrumentation Configurator user needs for dynamic event stream declaration worth the complexity?

Outcome: I would not block MP if dynamic streams are a requirement, and previously agreed upon by all parties.
The hook has been implemented in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventStreamConfig/+/1010607
Data collection tags deployment is decoupled from their configuration. The need for dynamic stream configuration is to avoid re-deploying ESC every time an analyst sets up or update a new instrument. A common use case is updating the sampling rate for a stream / instrument. Right now this property is declared in ESC, and changing that value would require re-deploying the extension.
if data collection code is embedded in MediaWiki, then it'll be deployed as part of the weekly deployment train. If it's embedded in the Android or iOS apps it'll be first sent to Google/Apple and then rolled out progressively.
Alternative approaches to a MP hook to declare dynamic streams have been explored in https://phabricator.wikimedia.org/T331514.
Right now Event Platform does not support this type of dynamic configuration. While there is a risk of MP becoming a dependency of EP (via the dynamic config hook), I think it’s a path we already crossed (Gobblin config are an EP dep via consumer settings). ESC is in a bit of a weird state and we are considering a refactoring (Stream Registry: WIP ), so I would not be able to provide alternative paths right now.

Are we sure we want to remove automatic event Hive table creation?

Outcome: Refine won’t remove support for automatic event table creation.
Functionally, the system should not change behavior for end users post-refactoring. Decision record and implementation considerations on how to support this have been discussed in slack.

Context

EventStreamConfig, DE Dataset Config Store integration and Refine refactoring.

We SPIKEd integration of ESC and config store in https://phabricator.wikimedia.org/T361017#9754921

Outcome: At least for the initial iteration, we'll keep event and stream configurations outside of https://gitlab.wikimedia.org/tchin/config-store-poc.

One of the goals of embedding ESC and event schema in Dataset Config Store was to improve audit capabilities and Hive DDL change management, for tables automatically created from streams at refine time, and streamline the ingestion process. The current process is documented in https://wikitech.wikimedia.org/wiki/User:Gmodena/Event_Platform/Refine_integration.

We (Andrew, Gabriele, Joseph) broadly agree on a path to auto-generate Hive DDL from event schema, and keep schema changes under revision control: https://wikimedia.slack.com/archives/C05RHK7PS6Q/p1714133754563449?thread_ts=1714075600.590239&cid=C05RHK7PS6Q . This would support requirements for introspection. We should flash options out in DPE Data Engineering - Refine System Refactoring.

MPIC dynamic stream creation

See Instrument Configurator – Design Document and Instrument Configurator – Decision Record – Where Do We Build the Instrument Configurator?.

An earlier design doc for this project further clarifies the requirements for this design choice
Design Document - Metrics Platform Stream Configuration Management Control Plane (see Proposed Solution section) .

MPIC requirements for dynamic stream configs come from the need to let users create, expire and version new streams per instrument via a web UI. EP currently does not support this capability, stream configs are updated via gitops on the EventStreamConfig extension repo. Changes are then deployed following Mediawiki trains (and backport windows).

To support dynamic streams, MP developed an extension that will piggy back on media wiki hooks to fetch configs from the MPIC backend, and merge them with ESC responses. Implementation details are discussed in Instrument Configurator – Design Document.

Concerns:

  • latency: the hook will return if a stream is found in ESC, and perform a callback to MPIC in case of a miss. We should carefully instrument this behaviour, since we had issues with ESC latency in the past (when retrieving large payloads)
  • provenance: we should be able to differentiate between streams coming form ESC and MPIC.
  • resource allocation: if we move away from gitops, to more dynamic configs we lose control about data inflow in Kafka and HDFS. We should have some quota in place (or safeguards) to avoid accidentally hammering systems.

Outcome: if the dynamic hook mechanism is a non-negotiable requirement and it was agreed upon, I would not block it. I would however like to understand implementation details of stream changes wrt instrument config changes. In the long term I’d like to refactor the hook mechanism, and have MPIC configs (dynamic or static) served by a refactored stream config service (see below). I think EP could support with hands on implementation, if all parties agree.

Refactoring of ESC to support MP and other use cases

Stream Registry: WIP is a high level proposal to refactor the existing ESC into a service oriented architecture that could support (or integrate better with) MP and DE use cases.

I have a PoC for a small service, built atop a config store, that supports stream retrieval (read) / creation (write) of git objects over HTTP. Discussed in slack.

I domain squatted https://stream-registry-poc.toolforge.org/api/stream/. This basic PoC runs atop the same service that @tchin@wikimedia.org developed to serve airflow/datasets configs, modified to support writes on /stream/ routes.

Outcome: We can produce ESC compatible payloads with confistore and jsonschema. I am not convinced that git (Gitlab APIs) is the best backend for this type of operation, but in principle it’s a possible path (though maybe not desirable). Next step for me would be to expand Stream Registry: WIP into a design doc (covering implementation details).

Outcome: Refine won’t remove support for automatic event table creation.

FWIW, with this outcome, then dynamic ESC as implemented is fine with me :)

Before resolving this ticket, I'd like to answer and document why MPIC needs dynamic stream declaration, as asked at T361853#9783572 and above.

This answer would inform what capabilities a refactored "stream registry" (and/or Datasets Config?) really need.

FWIW, with this outcome, then dynamic ESC as implemented is fine with me :)

Ack. Thanks for the f/up @Ottomata. Let's capture this in a decision record.

Before resolving this ticket, I'd like to answer and document why MPIC needs dynamic stream declaration, as asked at T361853#9783572 and above.

I don't have answers for the points you raised (I was not involved in those threads), but I'd be more than happy to sit in a sync session if useful.

cc @phuedx @WDoranWMF @cjming @Ahoelzl