Page MenuHomePhabricator

Modern Event Platform: Schema Guidelines and Conventions
Open, NormalPublic0 Story Points

Description

This task will be used to track work to develop good schema guidelines for schemas within the new Modern Event Platform system.

T201063: Modern Event Platform: Schema Registry is used to track actual implementation work of schema registries (git repositories, CI pipelines, HTTP servers, etc.), whereas this ticket should be used to collaborate and build good schema guidelines and conventions that the schema registry CI can enforce.

We need both style guides, as well as conventions for commonly used fields. We already have some good conventions for some fields in production mediawiki/event-schemas. We also need some conventions for fields we haven't needed yet for production services.

Event Timeline

Ottomata triaged this task as Normal priority.Jan 17 2019, 9:38 PM
Ottomata created this task.

I'm opening this task now, as we are starting to design some new schemas as part of our goals for Q3 2018-2019. We plan to migrate the Mediawiki monolog avro schemas and events over to JSON Schema and use the new EventGate stream intake service. T214080: Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate tracks this work.

The Avro CirrusSearchRequestSet schema contains some common HTTP fields that the EventLogging capsule also has, e.g. ip, userAgent, http queryString, referrer, etc. I imagine that most analytics EventLogging uses will need to have these fields standardized, as well as likely other HTTP request details (like other headers?).

I know the Product-Analytics team has thought some about this. This also fits in with the Better Use of Data program.

Since we actively need to write new schemas for T214080 now, we should get together and think about how we want to standardize them for the future.

I'd like to first tackle standardizing HTTP request information fields. E.g. user_agent, ip, accept_language, other request and response headers, cookies? uri_host, uri_query, etc. since T214080 includes this data.

Should these be top level field names or should these be in a subobject?

An example of a standardized subobject is the performer object we use in mediawiki events. This object contains information about the user performing an action that triggers an event. E.g. user-blocks-change performer is the user that modified the block, NOT the blocked user.

We could make an http_info (name TBD) subobject to contain pertinent (and mostly optional) http related fields. Something like:

http_info:
  uri_protocol: https://
  uri_host: en.wikipedia.org
  uri_path: ...
  uri_query: ...
  client_ip: ...
  request_headers:
    accept_language: ...
    user_agent: ...
  response_headers:
    x_analytics: ...

(Headers to be discussed below.) Alternatively these could all be top level fields added into schemas as needed, with standardized names in the 'data dictionary' (part of the Better Use of Data program(?)).

What to do about header maps?

A difficulty of JSONSchema is that there is no built in support for map types. That is, everything is an object, which when brought into strongly typed systems like Hadoop, means that they are structs with predefined field names. This is fine for 'maps' that have a small number of keys. We can create schemas that explicitly specify the keys as fields. For a large keyspace, this isn't really practical. I'd like to eventually support maps as a schema type annotation that our Stream Connectors would know how to interpret, but that is a way off.

For the problem of HTTP headers, the question is this: are the number of HTTP headers that we want to include in events (by default?) static enough (and small enough) to explicitly define and standardize them?

If yes, then we can just include them in whatever standard for http_info (described above) we decide on. If no, i.e. events should be able to include any header they want without it being defined in the schema, then we will have a problem until we figure out how to support map types.

FYI, the patch for review for CirrusSearchRequestSet is here: https://gerrit.wikimedia.org/r/#/c/mediawiki/event-schemas/+/485885/

Change 487154 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add test/event/0.0.3 with test_map example

https://gerrit.wikimedia.org/r/487154

After some more thoughts, I think we can support map types in JSONSchema via additionalProperties:

If additionalProperties is an object, that object is a schema that will be used to validate any additional properties not listed in properties.

Keys will always be strings, but we can specify that any key is allowed and all values must have the same type. This will allow us to support arbitrary http headers like:

http_headers:
  type: object
  additionalProperties:
    type: string

Hm, a downside of using a map for headers is that the keys won't necessarily be normalized. E.g. we might get User-Agent or user-agent or UsEr-AgeNT. We could restrict the keys in maps using patternProperties, e.g.

request_headers:
  type: object
  patternProperties:
    "[a-z0-0_]": 
      type: string

That would require that clients emit the maps all lower case and with hyphens (and/or other bad SQLish characters) converted to underscores.

This might be a bad idea and be too annoying/difficult for clients to implement, but it would make things much saner for downstream datastores. Hive will actually support any string keys, but users will get different results if e.g. different clients use different casings for the same header (they'll have to query for user-agent and User-Agent).

Ok, current proposal for standardized http info field:

http:
  type: object
  properties:

    uri:
      type: string
      description: The full URI of this HTTP request

    client_ip:
      type: string
      description: The http client's IP address

    request_headers:
      type: object
      description: Request headers sent by the client.
      additionalProperties:
        type: string

    response_headers:
      type: object
      description: Response headers sent by the server.
      additionalProperties:
        type: string

I decided to use a single uri field, rather than split up the uri into parts (like we do for the webrequest schema). It'll be simpler for clients to emit this rather than each know how to separate it.

Thoughts?

Overall the http structure looks perfectly reasonable. Adding a has_cookies boolean field could be useful as well. We likely don't want to record the full cookies header that was sent over http for privacy purposes, but knowing the client sent any cookies at all is a potential bot signal (wmf infra always sets the WMF-Last-Access cookie).

For the lowercasing headers, perhaps we need some sort of data-loading process than sanitizes the input from the raw events into hadoop? For example we have a client_ip here, but I can imagine a number of use cases where we instead want the approximate geo of the ip like webrequests. This seems like something that would naturally be handled during import. Perhaps that same import process could normalize http headers?

Milimetric added a comment.EditedFeb 5 2019, 9:51 PM

I like the [a-z0-9_] restriction being up-front. It's predictable and a very fair limitation considering the many systems this data is expected to flow through.

For the lowercasing headers ... seems like something that would naturally be handled during import. Perhaps that same import process could normalize http headers?

I think you are right. Requiring that clients sanitize headers is not easy to do, and there will be a refine process for all of this data (that does things like geocoding like you say). We can normalize the headers then and there, great idea!

I need to check with @JAllemandou to see how/if we can handle map types in our Refine code. Joseph, I think if (for now) we precreate Hive tables with map<string,string> in the schema, we can make Refine be smart enough. Not exactly sure how...but we might have to use existing Hive schema when reading the input JSON data...

We need to do some work in Refine (and hopefully eventually with Kafka Connect) to support map types, and also properly support integers vs. decimal types in JSONSchema -> Hive. Until we do, we won't be able to use the map types in Hadoop.

I think that's ok though, we should assume that we will be able to use them and not make ugly schemas just because of Hadoop here.

@mobrovac, @Pchelolo any thoughts on the http field schema convention I outlined in https://phabricator.wikimedia.org/T214093#4918832 ?

Adding a has_cookies boolean field could be useful as well. We likely don't want to record the full cookies header that was sent over http for privacy purposes, but knowing the client sent any cookies at all is a potential bot signal (wmf infra always sets the WMF-Last-Access cookie).

@EBernhardson, could we add a cookies (map) field and only populate it with safe cookies, e.g. WMF-Last-Access? I should note that the headers won't necessarily include ALL http headers for the request, only ones that the client decides to set.

Adding a has_cookies boolean field could be useful as well. We likely don't want to record the full cookies header that was sent over http for privacy purposes, but knowing the client sent any cookies at all is a potential bot signal (wmf infra always sets the WMF-Last-Access cookie).

@EBernhardson, could we add a cookies (map) field and only populate it with safe cookies, e.g. WMF-Last-Access? I should note that the headers won't necessarily include ALL http headers for the request, only ones that the client decides to set.

That seems perfectly reasonable and should cover my use case.

The http schema looks good, much simpler than the first version yet clear enough. For lowercasing headers, seems to me like the perfect place to do it is the entry point, i.e. EventGate, so that all parts of the system have a standard representation of the headers.

+1 on the idea of parsing the cookies and adding them into a separate map. That was we can control easily which cookies to white- or blacklist.

EventGate, so that all parts of the system have a standard representation of the headers.

Very much disagreed... How would you flag properties that should be lowercased versus those that should not. With the schema, we can enforce lowercase names when needed and reject improperly formatted events.

That would require that clients emit the maps all lower case and with hyphens (and/or other bad SQLish characters) converted to underscores.

That seems too much! Lowercasing - yes, replacing characters.... IMHO - no. Any character except US_ASCII 0-31 and Del can appear in the HTTP header name, so technically there could be 2 different headers that differ only by _ vs -.

Very much disagreed... How would you flag properties that should be lowercased versus those that should not. With the schema, we can enforce lowercase names when needed and reject improperly formatted events.

Rejecting events that haven't lowercased the headers is another option, albeit not as user-friendly.

That seems too much! Lowercasing - yes, replacing characters.... IMHO - no. Any character except US_ASCII 0-31 and Del can appear in the HTTP header name, so technically there could be 2 different headers that differ only by _ vs -.

+1, actually changing the data should not occur.

Rejecting events that haven't lowercased the headers is another option, albeit not as user-friendly.

No user should come into our clean and beautiful system with their dirty non-lowercased header names! :)

EBernhardson added a comment.EditedFeb 19 2019, 9:40 PM

Rejecting events that haven't lowercased the headers is another option, albeit not as user-friendly.

No user should come into our clean and beautiful system with their dirty non-lowercased header names! :)

Perhaps at least worth considering https://en.wikipedia.org/wiki/Robustness_principle

Be conservative in what you do, be liberal in what you accept from others

I'd really prefer if we kept as much transformation logic out of EventGate as possible. Right now it will fill in default values from the schema, but that's it. I think requiring clients to do header/cookie sanitization is a little too much. For the analytics use case in Hadoop, we do some event augmentation and sanitization during the Refine step. We could alternatively do these either in pre stream processing or during ingestion in Hadoop (with Kafka Connect and single message transforms), but in either case I don't think we need to solve this here, especially since we are going to use map types and the only validation will be that the keys are strings.

I guess an annoying bit about ^ is that clients will have to fill in the header information manually. This is easy enough for the ApiAction and CirrusSearchRequestSet stuff on the Mediawiki side, but it is a bit annoying from the browser/Javascript client. Things like HTTP headers are already sent with the HTTP request, and putting them in the event like this forces the client to also manually put them as entries in http.request_headers field, including simple things like User-Agent. Hm.

technically there could be 2 different headers that differ only by _ vs -.

Good thing we are doing this as map type with string keys then. Specific (object/struct) field names should never have '-' in them. :)

Nuria added a comment.EditedFeb 20 2019, 5:28 AM

@EBernhardson, could we add a cookies (map) field and only populate it with safe cookies, e.g. WMF-Last-Access? I should note that the headers won't necessarily include ALL http headers for the request, only ones that the client decides to set.

@EBernhardson would your case be satisfied with having just nocookies=1 when cookies are not sent? This would be easier to support and much more useful for bot detection, as you know this is what we do on the webrequest data.

@Nuria yeah that was his original suggestion (i.e. has_cookies: true), but I suggested making it a map so that we had a generic way to collect any (non-private) cookies from the get go. If we do indeed set the WMF-Last-Access cookie with every request, then an empty cookies map would mean the same as has_cookies: false, right?

Nuria added a comment.Feb 20 2019, 5:36 PM

@Ottomata If we are trying to store cookie info to better identify bots we do not need the map at all which could become a privacy problem due to private info. While I can see that http headers will be useful per request, I really cannot see cookies being useful that way in other than bot detection for which the map is not needed. Thoughts @EBernhardson ?

@Nuria what about WMF-Last-Access?

I don't think the actual date inside the WMF-Last-Access header makes any difference. Every request made sets it to todays date, so it only has a value other than the current date for 1 request per person per day. The chances of a particular event stream capturing that request is not something we can rely on, if we wanted to know the users last access date it would have to be aggregated out of webrequests table somehow.

I don't think the actual date inside the WMF-Last-Access header makes any difference.

I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)

I assume the discussion here is confined to the particular CirrusSearch use case, correct?

Naw, this discussion is around standardizing some places to store http request info. We don't have to include cookies in right now, I don't think either of the two schemas in question atm use them.

EBernhardson added a comment.EditedFeb 20 2019, 10:28 PM

I don't think the actual date inside the WMF-Last-Access header makes any difference.

I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)

Sure, but the context was important. All pages visited will set your WMF-Last-Access data to today. Unless you can guarantee that your event stream will fire on the first possible web request that a user makes in a day the WMF-Last-Access data will be useless outside of the webrequest table. It will simply say "today" (but with a date).

So for example a user comes to wikipedia. They type something into the search bar. All of those autocomplete requests are guaranteed to have "today" as the WMF-Last-Access cookie (almost, there are edge cases around the date switchover) because their last access was to request the page they are searching from.

Tbayer added a comment.EditedFeb 20 2019, 10:38 PM

I don't think the actual date inside the WMF-Last-Access header makes any difference.

I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)

Sure, but the context was important. All pages visited will set your WMF-Last-Access data to today. Unless you can guarantee that your event stream will fire on the first possible web request that a user makes in a day the WMF-Last-Access data will be useless outside of the webrequest table. It will simply say "today" (but with a date).

That's correct of course; I was referring to the "any difference" sentence I quoted, which seemed a bit too general as a statement in itself. Agreed that it makes sense when confined to the context of particular event streams. (Still, I thought that the long-term vision for MEP is to also incorporate pageviews or even webrequests as events in general.)

technically there could be 2 different headers that differ only by _ vs -.

Good thing we are doing this as map type with string keys then. Specific (object/struct) field names should never have '-' in them. :)

Note that (as we found out last week with @mforns in T216096#4953210 ff.) some current EL schemas do indeed have field names with dashes in them, which the EL refine process normalizes to underscores.

Nuria added a comment.Feb 21 2019, 1:10 AM

@Ottomata let's not include cookie info and just have a nocookie marker that is useful in this context to better identify bots.

Ok! Until we need it i'm going to leave it out of schemas, but most likely it will be a has_cookie: true/false type deal when we need it.

BTW, want to make sure this is clear and ok with yall. By having http.request_headers, I'm suggesting that this is where we default to storing any request header information, e.g. User-Agent and Content-Type. These will not be top level event fields, they will be map keys. This will make it simpler to standardize these types of fields in all the schemas that might use them. I think if we need certain fields to be pulled out to a top level e.g. in a Hive table, we can do that during refinement.

Until we need it i'm going to leave it out of schemas

Oh, sorry, I forgot this is actually used in CirrusSearchRequestSet now. I will add it to the schema there. I wonder if we should just add it in general, it might be nice to have in Api request logging too. I'll do that.

Nuria added a comment.Feb 22 2019, 4:32 AM

Should be has_cookie(s): true/false (PLURAL)

Ottomata moved this task from Backlog to In Progress on the EventBus board.Mar 25 2019, 6:36 PM

Change 487154 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add test/event/0.0.3 with test_map example

https://gerrit.wikimedia.org/r/487154

Change 487154 merged by Ottomata:
[mediawiki/event-schemas@master] Add test/event/0.0.3 with test_map example

https://gerrit.wikimedia.org/r/487154

Restricted Application added a project: Analytics. · View Herald TranscriptApr 29 2019, 3:41 PM

@Pchelolo

We should make all schemas set additionalProperties: false eh? Perhaps we can do this in our WMF meta schema?

Hm. On one hand I would agree given that additional properties is basically a go for anything, so it can easily break refinery for example. On the other hand having additional properties: false makes it much harder to add new fields - we need to first add field as optional, then start producing it and then make it required if it is required. However, I think it's a minor problem.

We definitely should only make the additionalProperties: false required for the root object cause some of the nested objects in the event are only possible with additional properties: true.

I think the schema evolution rule is that you can only add optional fields anyway.

Yeah, we'd set it to false in the root. For those schemas (like the job queue ones) that use additionalProperties true: we should switch them over to using the map type, e.g.

additionalProperties:
  type: string

Another thought @Pchelolo. Perhaps we should get rid of (or deprecate) meta.uri. I don't think it belongs in meta, as many events won't have a unique 'resource' uri.

We also need to consider how we are going to store governance information about schemas/streams, e.g. https://meta.wikimedia.org/wiki/Schema_talk:PageIssues. Lkelfy this should be in the stream config service, as schemas might be reused between different streams.