[SPIKE] Can we express Event Platform configs in Datasets Config?
Open, Needs TriagePublic5 Estimated Story PointsSpike
Actions

Assigned To

Authored By

	gmodena
	Mar 26 2024, 1:57 PM

Description

Right now event stream and processing pipeline configuration is spread across a few services (mw stream config extension, gobblin, schema repos, deployment-charts).

The goal of this spike is to prototype a dataset or tool oriented config for streams and streaming applications.

We need to investigate:

tool vs dataset oriented config layout.
interop and implications with k8s service deployments.
...

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		WDoranWMF	T360647 [Sprint 11 GOAL] SDS 2.5.5 Deliver working prototype for MP Instrumentation Configuration
Open		None	T360736 [Epic] Build the mechanism to adapt and deliver the output of the MPIC application
Resolved		phuedx	T360738 Update EventStreamConfigs extension to use MPMW hook
Open		Antoine_Quhen	T356762 [Refine refactoring] Extract refine schema management into a dedicated tool
Open		None	T354557 Dataset Config Store
Open		gmodena	T361853 [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator
Open	Spike	gmodena	T361017 [SPIKE] Can we express Event Platform configs in Datasets Config?

Event Timeline

gmodena created this task.Mar 26 2024, 1:57 PM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMar 26 2024, 1:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

lbowmaker set the point value for this task to 5.Apr 1 2024, 12:22 PM

lbowmaker moved this task from To be estimated/discussed to Q4 2024 April 1st - June 30th on the Data-Engineering board.

lbowmaker edited projects, added Data-Engineering (Q4 2024 April 1st - June 30th); removed Data-Engineering.

Ottomata mentioned this in T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.Apr 4 2024, 7:16 PM

Ottomata added a parent task: T354557: Dataset Config Store.

gmodena added a parent task: T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.Apr 5 2024, 6:42 AM

Ottomata renamed this task from [SPIKE] Can we express Event Platform configs in config store? to [SPIKE] Can we express Event Platform configs in Datasets Config?.Thu, Apr 25, 6:14 PM

Summary

Tl;dr:

We can easily express a stream config as jsonschema, and expose via datasets-config-service.
I am opposed to a monorepo for all configurations and suggest focusing current efforts on Airflow and Airflow-produced datasets. For integration with Metrics, Platform, and Mediawiki, I lean towards a service mesh approach. The service developed by @tchin could serve as a template.

The approach I've seen so far is really good in my opinion for Airflow DAG parametrization. IMHO it should be explicitly stated that the system we are building is the Airflow Dataset Config store/service, not just a generic configuration repository.

An Airflow GobblinOperator (which declares job settings) would serve as an API boundary between Airflow-managed datasets and Event Platform services. Its parametrization should be declared in the config store. This would also allow us to remove specific configurations in ESC (currently, a stream configuration declares when gobbling is expected to consume a stream, which would not be necessary if the requirement was specified downstream in the config store).

See the Prototype section for an example of such config.

Broader considerations

For this spike I investigated the current EventStreamConfig uses cases (beyond EventGate), tool vs dataset oriented config layout and interop and implications with k8s service deployments.

Based on my finding:

My preference is to keep EventStreamConfig and event streams declarations outside of a mono-repo config store. We could re-use the webservice business logic for a more general purpose database of stream configs. I will explore this concept further in https://phabricator.wikimedia.org/T361853 and Stream Registry: WIP , building atop https://gitlab.wikimedia.org/repos/data-engineering/dataset-config-store-service.

There will always be config bits that exist outside of DPE config store (mono) repos. k8s declares configs in deployment-charts. While we could parametrize flink applications via config store, we won’t be able to fully replace k8s (system) ones (e.g. memory allocation per pod). On k8s we also prefer to avoid HTTP calls to third party config services, so the config store should be embedded in e.g. Flink apps and merged with the helmfiles. I don’t see a lot of benefits to the added complexity. The same considerations apply for puppet. Effectively, both puppet and deployment-charts are API boundaries with other teams (SRE, mw).

The approach I’ve seen makes sense for parametrization of airflow operators. Effectively I see a coupling between the definition of a dataset and an airflow dag. In this sense, for DE / batch configs I think a Dataset “view” of the schema makes more sense than a tooling oriented one. Having everything in one place will help with developing tools for introspection (e.g. detect unused artifatcs deps becomes a simple graph traversal on the value files). In this scenario, an airflow GobblinOperator (that declares job settings) would act as an API boundary between Airflow managed Dataset and Event Platform services.

More thoughts can be found in https://wikimedia.slack.com/archives/C05RHK7PS6Q/p1713959682367569?thread_ts=1713956256.058079&cid=C05RHK7PS6Q

Prototype

I prototyped a (mock) config for webrequest ingestion. Code is available in gitlab https://gitlab.wikimedia.org/gmodena/config-store-poc/-/commit/37297c7af469fbc28bebf4de75236645a697ac27#783823b4ff48e4b442f3ed75eb4787e6657fc883_0_1.

This example adds a fragment schemas for

Gobblin Job config

Gobblin Event Platform config

Functional details are embedded in the description field of the linked schemas.

For simplicity, I added the fragments to the existing refine schema (but it’s just an example).
I materialized the schema in the following value file, that movies the ingestion bits of webrequest_frontend pipeline https://gitlab.wikimedia.org/gmodena/config-store-poc/-/blob/37297c7af469fbc28bebf4de75236645a697ac27/values/airflow/webrequest_frontend/values.yaml

$schema: /airflow/refine/1.0.0
descriptiom: >
  This a mock for code that does not exist yet.
  This value file reperesent a parametrization for the analytics/webrequest
  airflow dag. It assumes a Gobblin operator exists, that can be configured
  to produce the webrequest_raw dataset.
  This dataset config contains only business logic, the deployment of system (common)
  specific files is delegated to puppet and refinery.
datasets:
  - name: webrequest_raw
    fully_qualified_table_name: wmf_raw.webrequest
    gobblin:
      job: webrequest_frontend_rc0
      data_publisher_final_dir: /wmf/data/raw/webrequest
      mr_job_max_mappers: 48
      bootstrap_with_offset: latest
      writer_partition_timestamp_columns: dt
      event_platform:
        stream: webrequest_frontend_rc0
        is_wmf_production: true
    # - name: webrequest
    # <add config for airflow dag params>

This example assumes a GobblinOperator, that webrequest_raw will parametrize.

The example piggy backs on an EventStreamConfig endpoint. This is not strictly necessary, since Gobbling is stream agnostic. If needed, we could directly integrate Gobblin with Airflow by expressing only a subset of EventStreamConfig settings, namely, a list of topics to ingest. Gobblin does not perform schema validation, so an event schema labels are only metadata.

gmodena claimed this task.Mon, Apr 29, 6:36 PM

gmodena moved this task from Next Up to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.

IMHO it should be explicitly stated that the system we are building is the Airflow Dataset Config store/service, not just a generic configuration repository.

@gmodena @JAllemandou, if this is the case, do we need an external service and datastore? The config is all in git.

It would be simpler if the config was just deployed alongside of Airflow, and we provided a python library to access it? Would that satisfy the requirements then?

In T361017#9757573, @Ottomata wrote:

IMHO it should be explicitly stated that the system we are building is the Airflow Dataset Config store/service, not just a generic configuration repository.

@gmodena @JAllemandou, if this is the case, do we need an external service and datastore? The config is all in git.

I might be fuzzy on terminology for this project, but in this case git (cli or gitlab api) is the datastore.

It would be simpler if the config was just deployed alongside of Airflow, and we provided a python library to access it? Would that satisfy the requirements then?

+1 to keep things embeddable, but it would be nice if we had programmatic ways to clean airflow (values?) caches. My understanding is that the current approach would facilitate that..

Personally I do like the idea of exposing configs via an HTTPs endpoint (e.g. similar to etc, or chartmuseum). At least for read operations, writes I'm not so sure.

we had programmatic ways to clean airflow (values?) caches. My understanding is that the current approach would facilitate that.

@JAllemandou can you explain this one? I don't know what this means or what the problem is. Thank you!

gmodena mentioned this in T361094: Orchestrate gobblin ingestion task with Airflow and config store..Wed, May 15, 1:23 PM

[SPIKE] Can we express Event Platform configs in Datasets Config?Open, Needs TriagePublic5 Estimated Story PointsSpikeActions