Page MenuHomePhabricator

Figure out where stream/schema annotations belong (for sanitization and other use cases)
Closed, ResolvedPublic

Description

T263466 started a discussion that we've long put off, which can be broken down into two related parts:

  • Where should sanitization configuration for a stream live? (In the schema? In stream config? Somewhere else?)
  • How to automate setting of dynamic defaults in events; particularly from HTTP headers.

These seem like different issues, but it turns out they are related. In T248884, the idea was to add defined properties to a map type field. These properties would be ignored in the schema (since a map schema does not have struct sub fields; it has undefined map keys and values), but would be used by EventGate to automatically set map entries in http.request_headers if that field in schema had defined properties. This would be almost equivalent this annotations idea for sanitization settings. I was proposing to use defined properties to figure out something about the entries in the map field, but we could just as easily use some custom annotations that are not part of the JSONSchema spec too.

The discussion that this started became more general: Where does this type of information belong? Sanitization annotations seemingly belong in a schema, but there might be cases where a schema is used by multiple stream datasets, and whether or not a field should be sanitized might be dependent on the use case.

Example:

Perhaps there is a schema that has

properties:
  user_name:
    type: string

And two streams that use this schema: public_user_actions and private_user_actions. If we were to store sanitization configuration in the schema, we wouldn't have a way of applying different sanitization rules for these two datasets.

This is similar to the HTTP header defaults idea in T263466. I wanted to use the schema to determine if EventGate should set headers, like X-Client-IP. But, again, if there are two streams that use the same schema, but one wants X-Client-IP and the other doesn't, we wouldn't have a way to differ EventGate's behavior accordingly.


This task should be used to discuss this issue and try to come to a consensus about how we want to do this type of thing. Whatever we choose should probably be the same for both of these use cases, as well as future ones that need to add some information about a stream/schema.

Perhaps this all just belongs in event stream config? Perhaps it belongs in a data governance system like Atlas?

It sure would be simpler to implement if it was just in the schema. Do we need to support varying these types of annotations? Maybe not? Maybe if a stream needs e.g. different sanitization settings, it can just make and use a different schema.

Related Objects

StatusSubtypeAssignedTask
ResolvedOttomata
ResolvedOttomata
Resolvedmpopov
ResolvedOttomata
ResolvedOttomata
DeclinedNone
DeclinedNone
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedmpopov
Resolvedmpopov
Declinednettrom_WMF
DeclinedMayakp.wiki
Resolved jlinehan
DeclinedNone
ResolvedOttomata

Event Timeline

I think we've discussed this before, but just for the record:
I think one important aspect of the sanitization config is that changes to those configs can only take effect after a +2 from the analytics/security team.
Otherwise, that might cause privacy sensitive data to be stored for more than 90 days, plus back-filling, auditing and discussions.
So, I believe there must be some centralized control over sanitization configs. Maybe this fact helps us choose which way to go.

Ottomata claimed this task.
Ottomata triaged this task as Medium priority.

I'm going to close this task. We certainly have determined we don't think this type of info belongs in the schemas.