Page MenuHomePhabricator

Modern Event Platform: Stream Configuration
Open, NormalPublic0 Story Points

Description

User Stories

  • As an engineer, I want to specify the topic to schema mapping so that it is clear that a topic always uses a particular schema.
  • As an engineer, I want to specify the stream to topic mapping so it is clear which composite topics should be included in the same stream (e.g. (eqiad|codfw).mediawiki.revision-create)
  • As a product manager/analyst/engineer, I want to set the sampling settings of a stream without code deployment so I can easily adjust to changes in usage.
  • As a product manager/analyst/engineer, I want to set the privacy whitelist settings of stream's event fields so that I can retain non-PII data for longer than 90 days.
  • As an analyst, I want to know the schema, sampling, and other metadata settings that an event was emitted with so that I can account for these changes in analysis.
  • As a product manager/analyst/engineer, I want to set and discover the ownership of schemas and streams so I can track governance over time and know when a stream can be decommissioned.
  • As a community member, I want a high level view of what data is being collected so I have better transparency into WMF's use of data.

Event Timeline

Ottomata triaged this task as Normal priority.Sep 24 2018, 6:23 PM
Ottomata created this task.
Ottomata updated the task description. (Show Details)

As an engineer, I want to specify concrete settings for different topics like the number of partitions or the retention interval. T157092

Ottomata moved this task from Backlog to Parent Tasks on the Event-Platform board.Dec 5 2018, 10:05 PM

The output of this ticket should be a design document of how these use cases are satisfied, also addressing client side issues such us "how is the JS client fetching a new and updated sample rate" (varnish TTL expirations, etc)

  • As a product manager/analyst/engineer, I want to set the privacy whitelist settings of stream's event fields so that I can retain non-PII data for longer than 90 days.
  • As a product manager/analyst/engineer, I want to set and discover the ownership of schemas and streams so I can track governance over time and know when a stream can be decommissioned.

I'm recently wondering if we should separate out the governance and privacy use cases from this Stream Configuration Service. Adding dataset metadata to this makes the system much more complex, and there exist open source tools like Apache Atlas that probably do this better than we can.

Perhaps the system we should aim to build over the next few quarters should just be focused on the more stream/client setting specific use cases:

  • As an engineer, I want to specify the topic to schema mapping so that it is clear that a topic always uses a particular schema.
  • As an engineer, I want to specify the stream to topic mapping so it is clear which composite topics should be included in the same stream (e.g. (eqiad|codfw).mediawiki.revision-
  • As a product manager/analyst/engineer, I want to set the sampling settings of a stream without code deployment so I can easily adjust to changes in usage.
  • As an analyst, I want to know the schema, sampling, and other metadata settings that an event was emitted with so that I can account for these changes in analysis.

@Nuria, @jlinehan thoughts?

FYI, I wanted to know more about Apache Atlas, so I set it up a standalone on stat1004 and ran the Hive import process for the wmf and event databases. I added some glossary terms for 'user_agent' and 'ip', classified them as PII, tagged related fields, etc.

You can check it out via ssh tunnel:

ssh -N stat1004.eqiad.wmnet -L 21000:stat1004.eqiad.wmnet:21000

Then e.g. http://localhost:21000/#!/tag/tagAttribute/PII

Its pretty cool actually!

Check out:

https://atlas.apache.org/1.2.0/Glossary.html
and
https://atlas.apache.org/1.2.0/ClassificationPropagation.html

This would allow us to classify certain fields as PII, and then have derived tables with the same fields keep their PII classification. webrequest.user_agent_map -> pageview_hourly.user_agent_map would get PII too. We could then use the Atlas API for the whitelist input for the sanitization jobs instead of manually maintaining it.

Nuria added a subscriber: chasemp.Jul 2 2019, 10:52 PM

nice, seems a fit for data governance (cc @chasemp ) but for stream config? How would say sample rates be represented in the system?

@Nuria, see comment https://phabricator.wikimedia.org/T205319#5300239. I'm trying to isolate stream config uses the from the larger problem of data governance. Part of the upcoming projects will include uses cases from this as well as T201063: Modern Event Platform: Schema Registry, including things like schema UIs. Atlas has a 'schema' UI and a search engine for schema and dataset discovery. There's overlap with stream config, but I'm not sure if stream config itself fits into something like Atlas...maybe we could use for the UI components of stream config? Really not sure.

Ottomata updated the task description. (Show Details)Jul 3 2019, 7:34 PM
Ottomata renamed this task from Modern Event Platform: Stream Configuration Service to Modern Event Platform: Stream Configuration.Jul 17 2019, 8:19 PM