Page MenuHomePhabricator

Modern Event Platform: Schema Guidelines and Conventions
Closed, ResolvedPublic0 Estimated Story Points

Description

This task will be used to track work to develop good schema guidelines for schemas within the new Modern Event Platform system.

T201063: Modern Event Platform: Schema Repostories is used to track actual implementation work of schema registries (git repositories, CI pipelines, HTTP servers, etc.), whereas this ticket should be used to collaborate and build good schema guidelines and conventions that the schema registry CI can enforce.

We need both style guides, as well as conventions for commonly used fields. We already have some good conventions for some fields in production mediawiki/event-schemas. We also need some conventions for fields we haven't needed yet for production services.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedOttomata
Resolvedmpopov
ResolvedOttomata
ResolvedOttomata
DeclinedNone
DeclinedNone
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedmpopov
Resolvedmpopov
Declinednettrom_WMF
DeclinedMayakp.wiki
Resolvedjlinehan
DeclinedNone
ResolvedOttomata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ok, current proposal for standardized http info field:

http:
  type: object
  properties:

    uri:
      type: string
      description: The full URI of this HTTP request

    client_ip:
      type: string
      description: The http client's IP address

    request_headers:
      type: object
      description: Request headers sent by the client.
      additionalProperties:
        type: string

    response_headers:
      type: object
      description: Response headers sent by the server.
      additionalProperties:
        type: string

I decided to use a single uri field, rather than split up the uri into parts (like we do for the webrequest schema). It'll be simpler for clients to emit this rather than each know how to separate it.

Thoughts?

Overall the http structure looks perfectly reasonable. Adding a has_cookies boolean field could be useful as well. We likely don't want to record the full cookies header that was sent over http for privacy purposes, but knowing the client sent any cookies at all is a potential bot signal (wmf infra always sets the WMF-Last-Access cookie).

For the lowercasing headers, perhaps we need some sort of data-loading process than sanitizes the input from the raw events into hadoop? For example we have a client_ip here, but I can imagine a number of use cases where we instead want the approximate geo of the ip like webrequests. This seems like something that would naturally be handled during import. Perhaps that same import process could normalize http headers?

I like the [a-z0-9_] restriction being up-front. It's predictable and a very fair limitation considering the many systems this data is expected to flow through.

For the lowercasing headers ... seems like something that would naturally be handled during import. Perhaps that same import process could normalize http headers?

I think you are right. Requiring that clients sanitize headers is not easy to do, and there will be a refine process for all of this data (that does things like geocoding like you say). We can normalize the headers then and there, great idea!

I need to check with @JAllemandou to see how/if we can handle map types in our Refine code. Joseph, I think if (for now) we precreate Hive tables with map<string,string> in the schema, we can make Refine be smart enough. Not exactly sure how...but we might have to use existing Hive schema when reading the input JSON data...

We need to do some work in Refine (and hopefully eventually with Kafka Connect) to support map types, and also properly support integers vs. decimal types in JSONSchema -> Hive. Until we do, we won't be able to use the map types in Hadoop.

I think that's ok though, we should assume that we will be able to use them and not make ugly schemas just because of Hadoop here.

@mobrovac, @Pchelolo any thoughts on the http field schema convention I outlined in https://phabricator.wikimedia.org/T214093#4918832 ?

Adding a has_cookies boolean field could be useful as well. We likely don't want to record the full cookies header that was sent over http for privacy purposes, but knowing the client sent any cookies at all is a potential bot signal (wmf infra always sets the WMF-Last-Access cookie).

@EBernhardson, could we add a cookies (map) field and only populate it with safe cookies, e.g. WMF-Last-Access? I should note that the headers won't necessarily include ALL http headers for the request, only ones that the client decides to set.

Adding a has_cookies boolean field could be useful as well. We likely don't want to record the full cookies header that was sent over http for privacy purposes, but knowing the client sent any cookies at all is a potential bot signal (wmf infra always sets the WMF-Last-Access cookie).

@EBernhardson, could we add a cookies (map) field and only populate it with safe cookies, e.g. WMF-Last-Access? I should note that the headers won't necessarily include ALL http headers for the request, only ones that the client decides to set.

That seems perfectly reasonable and should cover my use case.

The http schema looks good, much simpler than the first version yet clear enough. For lowercasing headers, seems to me like the perfect place to do it is the entry point, i.e. EventGate, so that all parts of the system have a standard representation of the headers.

+1 on the idea of parsing the cookies and adding them into a separate map. That was we can control easily which cookies to white- or blacklist.

EventGate, so that all parts of the system have a standard representation of the headers.

Very much disagreed... How would you flag properties that should be lowercased versus those that should not. With the schema, we can enforce lowercase names when needed and reject improperly formatted events.

That would require that clients emit the maps all lower case and with hyphens (and/or other bad SQLish characters) converted to underscores.

That seems too much! Lowercasing - yes, replacing characters.... IMHO - no. Any character except US_ASCII 0-31 and Del can appear in the HTTP header name, so technically there could be 2 different headers that differ only by _ vs -.

Very much disagreed... How would you flag properties that should be lowercased versus those that should not. With the schema, we can enforce lowercase names when needed and reject improperly formatted events.

Rejecting events that haven't lowercased the headers is another option, albeit not as user-friendly.

That seems too much! Lowercasing - yes, replacing characters.... IMHO - no. Any character except US_ASCII 0-31 and Del can appear in the HTTP header name, so technically there could be 2 different headers that differ only by _ vs -.

+1, actually changing the data should not occur.

Rejecting events that haven't lowercased the headers is another option, albeit not as user-friendly.

No user should come into our clean and beautiful system with their dirty non-lowercased header names! :)

Rejecting events that haven't lowercased the headers is another option, albeit not as user-friendly.

No user should come into our clean and beautiful system with their dirty non-lowercased header names! :)

Perhaps at least worth considering https://en.wikipedia.org/wiki/Robustness_principle

Be conservative in what you do, be liberal in what you accept from others

I'd really prefer if we kept as much transformation logic out of EventGate as possible. Right now it will fill in default values from the schema, but that's it. I think requiring clients to do header/cookie sanitization is a little too much. For the analytics use case in Hadoop, we do some event augmentation and sanitization during the Refine step. We could alternatively do these either in pre stream processing or during ingestion in Hadoop (with Kafka Connect and single message transforms), but in either case I don't think we need to solve this here, especially since we are going to use map types and the only validation will be that the keys are strings.

I guess an annoying bit about ^ is that clients will have to fill in the header information manually. This is easy enough for the ApiAction and CirrusSearchRequestSet stuff on the Mediawiki side, but it is a bit annoying from the browser/Javascript client. Things like HTTP headers are already sent with the HTTP request, and putting them in the event like this forces the client to also manually put them as entries in http.request_headers field, including simple things like User-Agent. Hm.

technically there could be 2 different headers that differ only by _ vs -.

Good thing we are doing this as map type with string keys then. Specific (object/struct) field names should never have '-' in them. :)

@EBernhardson, could we add a cookies (map) field and only populate it with safe cookies, e.g. WMF-Last-Access? I should note that the headers won't necessarily include ALL http headers for the request, only ones that the client decides to set.

@EBernhardson would your case be satisfied with having just nocookies=1 when cookies are not sent? This would be easier to support and much more useful for bot detection, as you know this is what we do on the webrequest data.

@Nuria yeah that was his original suggestion (i.e. has_cookies: true), but I suggested making it a map so that we had a generic way to collect any (non-private) cookies from the get go. If we do indeed set the WMF-Last-Access cookie with every request, then an empty cookies map would mean the same as has_cookies: false, right?

@Ottomata If we are trying to store cookie info to better identify bots we do not need the map at all which could become a privacy problem due to private info. While I can see that http headers will be useful per request, I really cannot see cookies being useful that way in other than bot detection for which the map is not needed. Thoughts @EBernhardson ?

I don't think the actual date inside the WMF-Last-Access header makes any difference. Every request made sets it to todays date, so it only has a value other than the current date for 1 request per person per day. The chances of a particular event stream capturing that request is not something we can rely on, if we wanted to know the users last access date it would have to be aggregated out of webrequests table somehow.

I don't think the actual date inside the WMF-Last-Access header makes any difference.

I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)

I assume the discussion here is confined to the particular CirrusSearch use case, correct?

Naw, this discussion is around standardizing some places to store http request info. We don't have to include cookies in right now, I don't think either of the two schemas in question atm use them.

I don't think the actual date inside the WMF-Last-Access header makes any difference.

I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)

Sure, but the context was important. All pages visited will set your WMF-Last-Access data to today. Unless you can guarantee that your event stream will fire on the first possible web request that a user makes in a day the WMF-Last-Access data will be useless outside of the webrequest table. It will simply say "today" (but with a date).

So for example a user comes to wikipedia. They type something into the search bar. All of those autocomplete requests are guaranteed to have "today" as the WMF-Last-Access cookie (almost, there are edge cases around the date switchover) because their last access was to request the page they are searching from.

I don't think the actual date inside the WMF-Last-Access header makes any difference.

I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)

Sure, but the context was important. All pages visited will set your WMF-Last-Access data to today. Unless you can guarantee that your event stream will fire on the first possible web request that a user makes in a day the WMF-Last-Access data will be useless outside of the webrequest table. It will simply say "today" (but with a date).

That's correct of course; I was referring to the "any difference" sentence I quoted, which seemed a bit too general as a statement in itself. Agreed that it makes sense when confined to the context of particular event streams. (Still, I thought that the long-term vision for MEP is to also incorporate pageviews or even webrequests as events in general.)

technically there could be 2 different headers that differ only by _ vs -.

Good thing we are doing this as map type with string keys then. Specific (object/struct) field names should never have '-' in them. :)

Note that (as we found out last week with @mforns in T216096#4953210 ff.) some current EL schemas do indeed have field names with dashes in them, which the EL refine process normalizes to underscores.

@Ottomata let's not include cookie info and just have a nocookie marker that is useful in this context to better identify bots.

Ok! Until we need it i'm going to leave it out of schemas, but most likely it will be a has_cookie: true/false type deal when we need it.

BTW, want to make sure this is clear and ok with yall. By having http.request_headers, I'm suggesting that this is where we default to storing any request header information, e.g. User-Agent and Content-Type. These will not be top level event fields, they will be map keys. This will make it simpler to standardize these types of fields in all the schemas that might use them. I think if we need certain fields to be pulled out to a top level e.g. in a Hive table, we can do that during refinement.

Until we need it i'm going to leave it out of schemas

Oh, sorry, I forgot this is actually used in CirrusSearchRequestSet now. I will add it to the schema there. I wonder if we should just add it in general, it might be nice to have in Api request logging too. I'll do that.

Should be has_cookie(s): true/false (PLURAL)

Change 487154 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add test/event/0.0.3 with test_map example

https://gerrit.wikimedia.org/r/487154

Change 487154 merged by Ottomata:
[mediawiki/event-schemas@master] Add test/event/0.0.3 with test_map example

https://gerrit.wikimedia.org/r/487154

@Pchelolo

We should make all schemas set additionalProperties: false eh? Perhaps we can do this in our WMF meta schema?

Hm. On one hand I would agree given that additional properties is basically a go for anything, so it can easily break refinery for example. On the other hand having additional properties: false makes it much harder to add new fields - we need to first add field as optional, then start producing it and then make it required if it is required. However, I think it's a minor problem.

We definitely should only make the additionalProperties: false required for the root object cause some of the nested objects in the event are only possible with additional properties: true.

I think the schema evolution rule is that you can only add optional fields anyway.

Yeah, we'd set it to false in the root. For those schemas (like the job queue ones) that use additionalProperties true: we should switch them over to using the map type, e.g.

additionalProperties:
  type: string

Another thought @Pchelolo. Perhaps we should get rid of (or deprecate) meta.uri. I don't think it belongs in meta, as many events won't have a unique 'resource' uri.

We also need to consider how we are going to store governance information about schemas/streams, e.g. https://meta.wikimedia.org/wiki/Schema_talk:PageIssues. Lkelfy this should be in the stream config service, as schemas might be reused between different streams.

We are finally starting to work on T206814: CI Support for Schema Registry, so we can start to think about some other conventions we might want to adopt.

I'd like to start thinking about how to add some semantic annotation to fields for use in downstream systems. The analytics use cases are:

  • Privacy (PII) / sanitization annotations. E.g. fieldA should be kept, fieldB should be hashed, fieldC should be nullified (the default for historical data).

The goal of these annotations is to help automate (Druid) ingestion and sanitization/deletion of event data.

(Ping @mforns)

The fewer annotation we make up the better. Would an annotations object annotation suffice? Something like:

page_title:
  type: string
  annotations:
    sanitize: keep

country:
  type: string
  annotations:
    sanitize: hash

edit_count:
  type: integer
  annotations:
    metric: true

time_to_paint_ms:
  type: integer
  annotations:
    time_metric: true
    sanitize: keep

Q. Do we need to indicate that certain fields are PII only in combination with each other, like country + user_agent? If so, how could this be accomplished in the schema?

Q. Are we sure the schema is the right place to do this? Are there cases where different event streams (that use the same schema) would need e.g. different privacy sanitization settings?

Q. Should annotation changes be considered by backwards compatibility checks? (This would be really annoying to implement.

Note that these annotations would not show up in any event data. They would only be used by schema aware ingestion and sanitization code.

Q. Do we need to indicate that certain fields are PII only in combination with each other, like country + user_agent?

I do not think we want to make it that complicated. For example, we do not keep user, page pairs so if both appear in the schema one of them needs to be nullified. This measure would not be effective if with the remaining information you can build up a unique identifier for a user, thus, the nature of the data in the schema calls for different sanitization strategies and rather that building a config system that is complex, is best to leave up to humans to look at schema and see what sanitization is needed and tag individual fields accordingly.

Q. Are we sure the schema is the right place to do this? Are there cases where different event streams (that use the same schema) would need e.g. different privacy sanitization settings?

I think so, it is the nature of the data what drives sanitization nor its production or consumption. Since the schema describes the data is the perfect place to describe the sanitization/ingestion strategy.

Let's have in mind that not all schema fields are ingestable into druid as they are. Also, druid provides loads of capabilities when it comes to preprocess data (there are issues now with applying those while data is stored on parquet format and that is why we do not use them at this time but we will). See: https://druid.apache.org/docs/latest/ingestion/transform-spec.html , https://druid.apache.org/docs/latest/misc/math-expr.html

While sanitization is a clear fit for per scheme/per field annotations I am not sure that is the case for druid ingestion transforms, we certainly do not want to replicate on our end druid's transform spec so maybe it is worth thinking of ingestion guidelines as being another document rather than them being specified in the schema?

maybe it is worth thinking of ingestion guidelines as being another document rather than them being specified in the schema?

This is what we have now: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/druid_load.pp#L47-L49

Let's have in mind that not all schema fields are ingestable into druid as they are

Indeed! But one of the reasons for doing this was to enable and encourage folks to design schemas that are. If they do, loading data into druid would be as simple as adding a Hive table to a list of tables for the druid ingestion pipeline to load. And/or possibly would enable realtime inserts + hive ingestion (lambda arch style).

we certainly do not want to replicate on our end druid's transform spec

Agree.
Certainly some data will need specific druid ingestion specs to load, but my hope is that a lot won't! Also, the field annotations I'm talking about might be useful outside of druid as well. Prometheus (and other?) time series datastores have simliar concept of dimension/label and metrics. Heckay we might even be able to use this for T180105: Set up a statsv-like endpoint for Prometheus.

It sounds like the Druid use case needs a little more thought. I still think this is a good idea, but let's hold for now.

Let's decide on the privacy annotations and keep other annotations like Druid in mind as we do it.

I guess the decision we need to make is: do we want a single annotations object on fields like:

field_name:
  type: string
  annotations:
    sanitize: keep

Or do we want to just leave whatever annotations/tags we want to add in field definition directly like:

field_name:
  type: integer
  sanitize: keep
  metric: true

Multiple annotations directly in definition looks better.

Hm @mforns, just considered how map types would look if we use annotations on schemas for privacy and for druid ingestion (Re T208589).

The map keys are not present in the schema, so it wouldn't be possible to annotate them directly on the field in the way we are suggesting.

Can we apply sanitization to a map field on the whole? E.g. if sanitize: hash on a map field, just hash all values?

If not we could make up some annotation notation for map field keys.


time_visible_ms:
  type: string
  annotations:
    sanitize: keep
    olap_properties: [time_measure]

geocoded_data:
  type: object
  additionalProperties:
    type: string
  annotations:
    sanitize: country:keep,state:hash
    olap_properties: [country:dimension,state:dimension]

So, if the field is a map, then parse the annotation values to discover what to do with the map values.

Yargh, this gets even more complicated because map values can be complex object with specific schemas themselves! (Example). I suppose in that case, the map value type is explicitly declared as an object schema, so the map value's schema field annotations could still go in on the fields. Hm.

I just spent an hour trying to write an schema example and a potential algorithm for this. It is certainly possible to do, but not easy at all.
I think we can support annotations on map types with primitive values via a sanitize: key1:rule1,key2:rule2 string. Perhaps it would be acceptable to not support annotations on complex map value types for now?

We can note as such in our Schema/Guidelines and say that while complex map value types are supported, they will always be purged during sanitization, and cannot be used in druid auto ingestion?

@Ottomata Hm indeed...
Specifying sanitization for the whole map can be dangerous, because a non-sensitive map field marked as 'keep' could in the future be added a sensitive field, and it would be 'kept' by default.
How will it be with geocoded_data and user_agent fields? Are they structs or maps in MEP? Some EL schemas do specify to keep parts of them.
We could maybe do:

geocoded_data:
  type: object
  additionalProperties:
    type: string
  annotations:
    sanitize:
        country: keep
        state: hash
        blah:
            foo: keep
            bar: hash
    olap_properties: [country:dimension,state:dimension]

Hm, your example isn't quite right. geocoded_data is defined to be a map of string -> string, and blah: ... looks like an object (or another map?).

What do we do with olap_properties for nested objects in maps? Same thing?

I was hoping to keep the annotations very simple; it's fields would also be a string -> string map. Petr and I have dropped the idea (for now) of defining a custom WMF meta JSONSchema for our event schemas, but we should still try to apply the same rules we have for our event schemas to our schemas. In this case: no polymorphic types. We need to be able to define the annotations (and sanitize, and olap_properties) with static monomorphic types.

How will it be with geocoded_data and user_agent fields? Are they structs or maps in MEP?

It really depends on how they are defined in the schema. Hm. Oh. As we have them, geocoded_data and user_agent_map are NOT ACTUALLY in the schemas!!! We augment them into the data at refine time...

This is kind of a problem, since we'll need sanitization and druid ingestion info for these extra fields!

Perhaps we should add them into the event schema, even if they are never set by event producers? Oof. Perhaps this is a bad idea after all...

Perhaps we should add them into the event schema, even if they are never set by event producers?

The capsule is coming back!

Hi @Ottomata - I like the annotations subobject - Explicit for the win. I however have no good reason not to use the other solution.

The capsule is coming back!

Haha, not quite. The fields would still be defined explicitly. But yes, I also don't like this idea.

Another idea:
We can very easily support 'extending' schemas using jsonschema-tools. We could maintain a separate set of extended schemas for analytics/ingestion/hive/etl/refine/whatever. Something like:

title: refined/mediawiki/revision/create
$id: /refined/mediawiki/revision/create/1.0.0
allOf:
- $ref: /mediawiki/revision/create/1.0.0
- properties:
    geocoded_data:
      type: object
      properties:
        ...
      annotations:
        sanitize:
          ...
    user_agent_map:
      type: object
      additionalProperties:
        type: string
      annotations: ...

Or even more DRY:

title: analytics/mediawiki/revision/create
$id: /analytics/mediawiki/revision/create/1.0.0
allOf:
- $ref: /mediawiki/revision/create/1.0.0
- $ref: /analytics/common/1.0.0

With analytics/common defining common fields we augment event data with.

Talked with @Milimetric today and we think this approach makes sense. We don't need to move forward with it until A. we have a new 'analytics' schema repository and B. we have the time to start implementing uses of these annotations.

Just had some more thoughts about annotations. In the examples above, we have annotations directly on leaf fields as well as object fields. This is a source of confusion, since map type fields don't have their sub fields explicitly declared.

As another idea, we could take an example from JSONSchema itself. Draft-4 had the required keyword directly on fields. Newer drafts changed this to declaring the required fields by name in a required list on the object type field. We could do the same for our annotations.

title: my/great/schema
type: object
properties:

  # Primitive string type
  time_visible_ms:
    type: string

  # Map type, annotations are specified as part of object type declaration
  geocoded_data:
    type: object
    additionalProperties:
      type: string
    # annotations for geocoded_data map fields
    annotations:
      country:
        sanitize: keep
        olap_properties: [dimension]
      state:
        sanitize: hash
        olap_properties: [dimension]

# annotations for top level fields
annotations:
  time_visible_ms:
    sanitize: keep
    olap_properties: [time_measure]
  geocoded_data:
    sanitize: keep

required:
  - time_visible_ms
  - geocoded_data

This solves the problem of not having to reference sub field names in the annotations themselves as the annotations are always applied to the direct fields of an object. It also makes the algorithm simpler, as applying any annotation logic is the same at every level of the schema. However, as you can see in the example, both the top level geocoded_data field and its sub field's sanitization annotations must be specified separately. This would be true for every nested level, which could get confusing for schema authors.

+1, seems very straight forward