Figure out where stream/schema annotations belong (for sanitization and other use cases)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Sep 23 2020, 5:40 PM

Description

T263466 started a discussion that we've long put off, which can be broken down into two related parts:

Where should sanitization configuration for a stream live? (In the schema? In stream config? Somewhere else?)
How to automate setting of dynamic defaults in events; particularly from HTTP headers.

These seem like different issues, but it turns out they are related. In T248884, the idea was to add defined properties to a map type field. These properties would be ignored in the schema (since a map schema does not have struct sub fields; it has undefined map keys and values), but would be used by EventGate to automatically set map entries in http.request_headers if that field in schema had defined properties. This would be almost equivalent this annotations idea for sanitization settings. I was proposing to use defined properties to figure out something about the entries in the map field, but we could just as easily use some custom annotations that are not part of the JSONSchema spec too.

The discussion that this started became more general: Where does this type of information belong? Sanitization annotations seemingly belong in a schema, but there might be cases where a schema is used by multiple stream datasets, and whether or not a field should be sanitized might be dependent on the use case.

Example:

Perhaps there is a schema that has

properties:
  user_name:
    type: string

And two streams that use this schema: public_user_actions and private_user_actions. If we were to store sanitization configuration in the schema, we wouldn't have a way of applying different sanitization rules for these two datasets.

This is similar to the HTTP header defaults idea in T263466. I wanted to use the schema to determine if EventGate should set headers, like X-Client-IP. But, again, if there are two streams that use the same schema, but one wants X-Client-IP and the other doesn't, we wouldn't have a way to differ EventGate's behavior accordingly.

This task should be used to discuss this issue and try to come to a consensus about how we want to do this type of thing. Whatever we choose should probably be the same for both of these use cases, as well as future ones that need to add some information about a stream/schema.

Perhaps this all just belongs in event stream config? Perhaps it belongs in a data governance system like Atlas?

It sure would be simpler to implement if it was just in the schema. Do we need to support varying these types of annotations? Maybe not? Maybe if a stream needs e.g. different sanitization settings, it can just make and use a different schema.

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T185233 Modern Event Platform
Resolved	Ottomata	T214093 Modern Event Platform: Schema Guidelines and Conventions
Resolved	mpopov	T214129 Provide Product Analytics input on Modern Event Platform schema conventions
Resolved	Ottomata	T215442 Make Refine use JSONSchemas of event data to support Map types and proper types for integers vs decimals
Resolved	Ottomata	T218617 Fix EventLogging schemas that use array for items type
Declined	None	T218347 Ingest cirrussearchrequest data into druid
Declined	None	T222656 Fix active EventLogging schemas that added backwards incompatable required fields.
Resolved	Ottomata	T212529 Standardize datetimes/timestamps in the Data Lake
Resolved	Ottomata	T217040 Add UTC 'Z' suffix to webrequest `dt` field.
Resolved	Ottomata	T217041 Use Z UTC suffix in EventBus emitted events rather than +00:00
Resolved	Ottomata	T233329 Write and update Event Platform instrumentation documentation for Product teams
Resolved	mpopov	T253269 Product Analytics to review & provide feedback for Event Platform Instrumentation How-To
Resolved	mpopov	T254810 Mikhail's review of Event Platform Instrumentation How-To
Declined	nettrom_WMF	T254811 Morten's review of Event Platform Instrumentation How-To
Declined	Mayakp.wiki	T254812 Maya's review of Event Platform Instrumentation How-To
Resolved	• jlinehan	T254813 Jason's review of Event Platform Instrumentation How-To
Declined	None	T210012 Define how we vet code & data for ongoing, automated ingestion in Druid
Resolved	Ottomata	T263672 Figure out where stream/schema annotations belong (for sanitization and other use cases)

Event Timeline

Ottomata created this task.Sep 23 2020, 5:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2020, 5:40 PM

Ottomata added parent tasks: T214093: Modern Event Platform: Schema Guidelines and Conventions, T185233: Modern Event Platform.Sep 23 2020, 5:41 PM

Ottomata updated the task description. (Show Details)

Ottomata updated the task description. (Show Details)Sep 23 2020, 6:37 PM

I think we've discussed this before, but just for the record:
I think one important aspect of the sanitization config is that changes to those configs can only take effect after a +2 from the analytics/security team.
Otherwise, that might cause privacy sensitive data to be stored for more than 90 days, plus back-filling, auditing and discussions.
So, I believe there must be some centralized control over sanitization configs. Maybe this fact helps us choose which way to go.

+2 to @mforns comments

nshahquinn-wmf subscribed.Sep 25 2020, 1:57 PM

• fdans moved this task from Incoming to Event Platform on the Analytics board.Oct 8 2020, 5:13 PM

Ottomata mentioned this in T263466: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data.Oct 14 2020, 2:54 PM

CDanis subscribed.Oct 26 2020, 1:14 PM

Ottomata mentioned this in T210012: Define how we vet code & data for ongoing, automated ingestion in Druid.Oct 27 2020, 3:59 PM

CDanis mentioned this in T262626: Remove http.client_ip from EventGate default schema (again).Oct 28 2020, 3:23 PM

Ottomata mentioned this in T267592: Updated schema strategy for analytics events.Dec 2 2020, 6:07 PM

Ottomata mentioned this in T271456: Enable 'skin' dimension using stream configuration.Jan 26 2021, 9:43 PM

Ottomata mentioned this in T273293: Define acceptable usage of the `meta` object in event schemas.Jan 29 2021, 5:46 PM

Ottomata mentioned this in T273235: [Metrics Platform] Define stream configuration syntax relevant to v1 release.Feb 1 2021, 2:17 PM

I'm going to close this task. We certainly have determined we don't think this type of info belongs in the schemas.

Figure out where stream/schema annotations belong (for sanitization and other use cases)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Figure out where stream/schema annotations belong (for sanitization and other use cases)
Closed, ResolvedPublic
Actions

Related Objects
Search...