Page MenuHomePhabricator

[SPIKE] Can we express Event Platform configs in Datasets Config?
Open, Needs TriagePublic5 Estimated Story PointsSpike

Description

Right now event stream and processing pipeline configuration is spread across a few services (mw stream config extension, gobblin, schema repos, deployment-charts).

The goal of this spike is to prototype a dataset or tool oriented config for streams and streaming applications.

We need to investigate:

  • tool vs dataset oriented config layout.
  • interop and implications with k8s service deployments.
  • ...

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMar 26 2024, 1:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata renamed this task from [SPIKE] Can we express Event Platform configs in config store? to [SPIKE] Can we express Event Platform configs in Datasets Config?.Thu, Apr 25, 6:14 PM

Summary

Tl;dr:

We can easily express a stream config as jsonschema, and expose via datasets-config-service.
I am opposed to a monorepo for all configurations and suggest focusing current efforts on Airflow and Airflow-produced datasets. For integration with Metrics, Platform, and Mediawiki, I lean towards a service mesh approach. The service developed by @tchin could serve as a template.

The approach I've seen so far is really good in my opinion for Airflow DAG parametrization. IMHO it should be explicitly stated that the system we are building is the Airflow Dataset Config store/service, not just a generic configuration repository.

An Airflow GobblinOperator (which declares job settings) would serve as an API boundary between Airflow-managed datasets and Event Platform services. Its parametrization should be declared in the config store. This would also allow us to remove specific configurations in ESC (currently, a stream configuration declares when gobbling is expected to consume a stream, which would not be necessary if the requirement was specified downstream in the config store).

See the Prototype section for an example of such config.

Broader considerations

For this spike I investigated the current EventStreamConfig uses cases (beyond EventGate), tool vs dataset oriented config layout and interop and implications with k8s service deployments.

Based on my finding:

  1. My preference is to keep EventStreamConfig and event streams declarations outside of a mono-repo config store. We could re-use the webservice business logic for a more general purpose database of stream configs. I will explore this concept further in https://phabricator.wikimedia.org/T361853 and Stream Registry: WIP , building atop https://gitlab.wikimedia.org/repos/data-engineering/dataset-config-store-service.
  1. There will always be config bits that exist outside of DPE config store (mono) repos. k8s declares configs in deployment-charts. While we could parametrize flink applications via config store, we won’t be able to fully replace k8s (system) ones (e.g. memory allocation per pod). On k8s we also prefer to avoid HTTP calls to third party config services, so the config store should be embedded in e.g. Flink apps and merged with the helmfiles. I don’t see a lot of benefits to the added complexity. The same considerations apply for puppet. Effectively, both puppet and deployment-charts are API boundaries with other teams (SRE, mw).
  1. The approach I’ve seen makes sense for parametrization of airflow operators. Effectively I see a coupling between the definition of a dataset and an airflow dag. In this sense, for DE / batch configs I think a Dataset “view” of the schema makes more sense than a tooling oriented one. Having everything in one place will help with developing tools for introspection (e.g. detect unused artifatcs deps becomes a simple graph traversal on the value files). In this scenario, an airflow GobblinOperator (that declares job settings) would act as an API boundary between Airflow managed Dataset and Event Platform services.

More thoughts can be found in https://wikimedia.slack.com/archives/C05RHK7PS6Q/p1713959682367569?thread_ts=1713956256.058079&cid=C05RHK7PS6Q

Prototype

I prototyped a (mock) config for webrequest ingestion. Code is available in gitlab https://gitlab.wikimedia.org/gmodena/config-store-poc/-/commit/37297c7af469fbc28bebf4de75236645a697ac27#783823b4ff48e4b442f3ed75eb4787e6657fc883_0_1.

This example adds a fragment schemas for

Functional details are embedded in the description field of the linked schemas.

For simplicity, I added the fragments to the existing refine schema (but it’s just an example).
I materialized the schema in the following value file, that movies the ingestion bits of webrequest_frontend pipeline https://gitlab.wikimedia.org/gmodena/config-store-poc/-/blob/37297c7af469fbc28bebf4de75236645a697ac27/values/airflow/webrequest_frontend/values.yaml

$schema: /airflow/refine/1.0.0
descriptiom: >
  This a mock for code that does not exist yet.
  This value file reperesent a parametrization for the analytics/webrequest
  airflow dag. It assumes a Gobblin operator exists, that can be configured
  to produce the webrequest_raw dataset.
  This dataset config contains only business logic, the deployment of system (common)
  specific files is delegated to puppet and refinery.
datasets:
  - name: webrequest_raw
    fully_qualified_table_name: wmf_raw.webrequest
    gobblin:
      job: webrequest_frontend_rc0
      data_publisher_final_dir: /wmf/data/raw/webrequest
      mr_job_max_mappers: 48
      bootstrap_with_offset: latest
      writer_partition_timestamp_columns: dt
      event_platform:
        stream: webrequest_frontend_rc0
        is_wmf_production: true
    # - name: webrequest
    # <add config for airflow dag params>

This example assumes a GobblinOperator, that webrequest_raw will parametrize.

The example piggy backs on an EventStreamConfig endpoint. This is not strictly necessary, since Gobbling is stream agnostic. If needed, we could directly integrate Gobblin with Airflow by expressing only a subset of EventStreamConfig settings, namely, a list of topics to ingest. Gobblin does not perform schema validation, so an event schema labels are only metadata.

IMHO it should be explicitly stated that the system we are building is the Airflow Dataset Config store/service, not just a generic configuration repository.

@gmodena @JAllemandou, if this is the case, do we need an external service and datastore? The config is all in git.

It would be simpler if the config was just deployed alongside of Airflow, and we provided a python library to access it? Would that satisfy the requirements then?

IMHO it should be explicitly stated that the system we are building is the Airflow Dataset Config store/service, not just a generic configuration repository.

@gmodena @JAllemandou, if this is the case, do we need an external service and datastore? The config is all in git.

I might be fuzzy on terminology for this project, but in this case git (cli or gitlab api) is the datastore.

It would be simpler if the config was just deployed alongside of Airflow, and we provided a python library to access it? Would that satisfy the requirements then?

+1 to keep things embeddable, but it would be nice if we had programmatic ways to clean airflow (values?) caches. My understanding is that the current approach would facilitate that..

Personally I do like the idea of exposing configs via an HTTPs endpoint (e.g. similar to etc, or chartmuseum). At least for read operations, writes I'm not so sure.

we had programmatic ways to clean airflow (values?) caches. My understanding is that the current approach would facilitate that.

@JAllemandou can you explain this one? I don't know what this means or what the problem is. Thank you!