[Metrics Platform] Define stream configuration syntax relevant to v1 release
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jlinehan
	Jan 28 2021, 10:06 PM

Description

This task is for determining basic stream configuration syntax. We can collect requirements and bikeshedding here to the extent that we need to, so we don't spread the discussion out over too many other tasks. Will update description with ideas and proposals.

The an example of thee current syntax is documented in the README file for the EventStreamConfig extension:
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventStreamConfig/+/refs/heads/master/README.md#mediawiki-config

In use syntax is in InitialiseSettings.php in mediawiki-config.

Design document focusing on producer syntax can be found here.

Related Objects
Search...

Status	Assigned	Task
Open	None	T276378 EPIC: Release Metrics Platform v1
Resolved	• DAbad	T273235 [Metrics Platform] Define stream configuration syntax relevant to v1 release
Resolved	• Mholloway	T271456 Enable 'skin' dimension using stream configuration

Event Timeline

• jlinehan created this task.Jan 28 2021, 10:06 PM

• jlinehan added a subtask: T271456: Enable 'skin' dimension using stream configuration.

• jlinehan mentioned this in T271456: Enable 'skin' dimension using stream configuration.Jan 28 2021, 10:09 PM

• Mholloway updated the task description. (Show Details)Jan 28 2021, 10:10 PM

Proposal from @Ottomata in T271456#6785277 for configuring supplemental data injection from the intake service and/or client library:

event_hydrations:
  producer_client: 
    mediawiki_skin:
      enabled: true
      field: dimensions.skin
  intake_service:
    request_headers:
      user-agent:
        enabled: true
        field: http.request_headers.user-agent
      x-client-ip:
        enabled: true
        field: http.client_ip

My first thought is that we could simply include/omit fields rather than having an enabled flag for each of them.

Even then, if we end up putting a structure like this into the stream configuration for many streams, it could easily become quite verbose and repetitive (and unwelcoming to non-developers, although that ship may have already sailed). Maybe that's just the price of flexibility.

The more I think about it, the more I come around to the idea of annotating the schema itself to indicate the a field should be injected by the intake service or the client library. I was against it at first because of the idea that schemas are for validation and this is a step beyond validation, but it's arguably cleaner to do it that way where possible.

I'm not sure whether it's best to treat data supplementation configuration for the intake service and the client libraries the same way. The intake service could potentially lean solely on schema annotations to tell it where to inject supplemental data that's enabled in the stream config. For the client libraries that's not true, since we need to be able to handle essentially arbitrary unknown schemas, unless we want to enforce that supplemental data from the client library must live in the same specific place for any schema that wants it. (Then I guess we're more or less back at the event capsule? Not that that's a bad thing per se...)

Ottomata updated the task description. (Show Details)Feb 1 2021, 2:09 PM

I'm not sure whether it's best to treat data supplementation configuration for the intake service and the client libraries the same way. The intake service could potentially lean solely on schema annotations to tell it where to inject supplemental data that's enabled in the stream config. For the client libraries that's not true, since we need to be able to handle essentially arbitrary unknown schemas, unless we want to enforce that supplemental data from the client library must live in the same specific place for any schema that wants it.

After our discussions about T263672 and T263466, I was convinced that schemas are not the correct place for this information. The schemas are a datatype, and can have multiple instantiations (streams). Some schema annotations may make some sense, (maybe privacy/PII fields), but others like the ones we are discussing: are pretty tricky:

What if we wanted to have two streams with the same schema behave differently?
What if we wanted to stop populating e.g. client_ip for a stream? Do we edit schema annotations for all versions of a schema?

I think your point about how client libraries don't necessary have the schema is valid for this idea in general. EventGate has the schemas because it validates, but not all clients will use EventGate. Internal ones (that are not PHP) are more likely to produce using Kafka client directly, and as such need to have richer client libraries that do much of what EventGate does, and probably some things that EventBus or EventLogging do with respect to event hydrations/augmentations.

• fdans moved this task from Incoming to Event Platform on the Analytics board.Feb 1 2021, 4:55 PM

kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.Feb 2 2021, 6:34 PM

• Mholloway closed subtask T271456: Enable 'skin' dimension using stream configuration as Resolved.Feb 10 2021, 7:07 PM

Ottomata mentioned this in T273901: Automate event stream ingestion into HDFS for streams that don't use EventGate.Mar 3 2021, 4:51 PM

Perhaps it would be better to add top level settings specific to different producers and consumers, as described in https://phabricator.wikimedia.org/T273901#6879350. E.g.

producers:
  mediawiki_client:
    hydrations:
      mediawiki_skin:
        enabled: true
        field: dimensions.skin
  ...
consumers:
  # Analytics will use this to automate ingestion of this data into Hadoop.
  analytics-hadoop:
    job_name: general

• jlinehan renamed this task from Define event stream configuration syntax to [Metrics Platform] Define stream configuration syntax relevant to v1 release.Mar 3 2021, 6:54 PM

• jlinehan added a parent task: T276378: EPIC: Release Metrics Platform v1.

• jlinehan triaged this task as Medium priority.Mar 8 2021, 1:36 PM

• jlinehan moved this task from Inbox to Doing on the Better Use Of Data board.

• jlinehan claimed this task.Mar 10 2021, 3:17 PM

• jlinehan updated the task description. (Show Details)

• jlinehan moved this task from Doing to QA/Review on the Better Use Of Data board.Mar 24 2021, 6:01 PM

ldelench_wmf moved this task from QA/Review to Sign-off on the Better Use Of Data board.Apr 21 2021, 6:30 PM

Assigning to @DAbad for sign off

Ottomata mentioned this in T277193: wgEventStreams (EventStreamConfig) should support per wiki overrides.Jun 22 2021, 2:54 PM

Hey all, we are moving forward with the consumers stream config setting to do T273901: Automate event stream ingestion into HDFS for streams that don't use EventGate now. I think we are in agreement at least on the general format of how client configuration will work, right? A consumers\ and producers top level map setting, in which keys are client names mapping to client specific settings.

https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668124/2/wmf-config/InitialiseSettings.php

Ottomata mentioned this in T288853: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 .Aug 30 2021, 1:40 PM

Adding Metrics Platform Backlog so Metrics Platform tasks can be found under Metrics Platform Backlog.

Closing this s we've moved on to more specific tasks and have de facto formats in the library code now.

• jlinehan closed this task as Resolved.Oct 6 2021, 2:23 PM

Ottomata mentioned this in T303602: Generate $wgEventLoggingStreamNames from $wgEventStreams.Mar 14 2022, 11:51 AM

Iflorez mentioned this in T343246: Document data engineering items for Campaigns Product.Sep 4 2023, 7:00 PM

Ottomata mentioned this in T318863: [Event Platform] Event Platform and DataHub Integration.Oct 2 2023, 5:56 PM

[Metrics Platform] Define stream configuration syntax relevant to v1 releaseClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[Metrics Platform] Define stream configuration syntax relevant to v1 release
Closed, ResolvedPublic
Actions

Related Objects
Search...