Page MenuHomePhabricator

[Metrics Platform] Define stream configuration syntax relevant to v1 release
Closed, ResolvedPublic

Description

This task is for determining basic stream configuration syntax. We can collect requirements and bikeshedding here to the extent that we need to, so we don't spread the discussion out over too many other tasks. Will update description with ideas and proposals.

The an example of thee current syntax is documented in the README file for the EventStreamConfig extension:
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventStreamConfig/+/refs/heads/master/README.md#mediawiki-config

In use syntax is in InitialiseSettings.php in mediawiki-config.

Design document focusing on producer syntax can be found here.

Event Timeline

Proposal from @Ottomata in T271456#6785277 for configuring supplemental data injection from the intake service and/or client library:

event_hydrations:
  producer_client: 
    mediawiki_skin:
      enabled: true
      field: dimensions.skin
  intake_service:
    request_headers:
      user-agent:
        enabled: true
        field: http.request_headers.user-agent
      x-client-ip:
        enabled: true
        field: http.client_ip

My first thought is that we could simply include/omit fields rather than having an enabled flag for each of them.

Even then, if we end up putting a structure like this into the stream configuration for many streams, it could easily become quite verbose and repetitive (and unwelcoming to non-developers, although that ship may have already sailed). Maybe that's just the price of flexibility.

The more I think about it, the more I come around to the idea of annotating the schema itself to indicate the a field should be injected by the intake service or the client library. I was against it at first because of the idea that schemas are for validation and this is a step beyond validation, but it's arguably cleaner to do it that way where possible.

I'm not sure whether it's best to treat data supplementation configuration for the intake service and the client libraries the same way. The intake service could potentially lean solely on schema annotations to tell it where to inject supplemental data that's enabled in the stream config. For the client libraries that's not true, since we need to be able to handle essentially arbitrary unknown schemas, unless we want to enforce that supplemental data from the client library must live in the same specific place for any schema that wants it. (Then I guess we're more or less back at the event capsule? Not that that's a bad thing per se...)

I'm not sure whether it's best to treat data supplementation configuration for the intake service and the client libraries the same way. The intake service could potentially lean solely on schema annotations to tell it where to inject supplemental data that's enabled in the stream config. For the client libraries that's not true, since we need to be able to handle essentially arbitrary unknown schemas, unless we want to enforce that supplemental data from the client library must live in the same specific place for any schema that wants it.

After our discussions about T263672 and T263466, I was convinced that schemas are not the correct place for this information. The schemas are a datatype, and can have multiple instantiations (streams). Some schema annotations may make some sense, (maybe privacy/PII fields), but others like the ones we are discussing: are pretty tricky:

  • What if we wanted to have two streams with the same schema behave differently?
  • What if we wanted to stop populating e.g. client_ip for a stream? Do we edit schema annotations for all versions of a schema?

I think your point about how client libraries don't necessary have the schema is valid for this idea in general. EventGate has the schemas because it validates, but not all clients will use EventGate. Internal ones (that are not PHP) are more likely to produce using Kafka client directly, and as such need to have richer client libraries that do much of what EventGate does, and probably some things that EventBus or EventLogging do with respect to event hydrations/augmentations.

Perhaps it would be better to add top level settings specific to different producers and consumers, as described in https://phabricator.wikimedia.org/T273901#6879350. E.g.

producers:
  mediawiki_client:
    hydrations:
      mediawiki_skin:
        enabled: true
        field: dimensions.skin
  ...
consumers:
  # Analytics will use this to automate ingestion of this data into Hadoop.
  analytics-hadoop:
    job_name: general
jlinehan renamed this task from Define event stream configuration syntax to [Metrics Platform] Define stream configuration syntax relevant to v1 release.Mar 3 2021, 6:54 PM
jlinehan moved this task from Inbox to Doing on the Better Use Of Data board.
kzimmerman added a subscriber: DAbad.

Assigning to @DAbad for sign off

Hey all, we are moving forward with the consumers stream config setting to do T273901: Automate event stream ingestion into HDFS for streams that don't use EventGate now. I think we are in agreement at least on the general format of how client configuration will work, right? A consumers\ and producers top level map setting, in which keys are client names mapping to client specific settings.

https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668124/2/wmf-config/InitialiseSettings.php

Closing this s we've moved on to more specific tasks and have de facto formats in the library code now.