Page MenuHomePhabricator

Write and update Event Platform instrumentation documentation for Product teams
Closed, ResolvedPublic

Event Timeline

nshahquinn-wmf claimed this task.
nshahquinn-wmf triaged this task as Medium priority.
nshahquinn-wmf moved this task from Triage to Next Up on the Product-Analytics board.

So far this mediawiki:Event Platform/Schemas draft is mostly documentation about how to create and modify schemas in a git repository. We'll also need some documentation about fields and naming conventions, TBD.

Thank you for putting all this effort into documentation so early, @Ottomata!

I really like where things are at (particularly the context section and the focus on giving step-by-step instructions for the main tasks users will do), but of course all of these comments focus on the things I noticed could be improved. At a few points, I asked questions and then later realized the answers, but I left the original questions there so you could see my thought process :)

JSONSchema provides powerful data validation, but unlike Avro, it does not provide schema evolution. That is, each schema is distinct, and there is no way to explicitly declare that a given schema is just a new version of another. Schema evolution is necessary to be able to reliably upgrade producer and consumer code.

So...we can't reliably upgrade producer and consumer code? That seems bad; did we work around this limitation somehow?

A schema repository is a git repository with a hierarchy of versioned JSONSchema files, with a layout something like:

I had to think for a little bit before I realized this was the file layout. Also, it would be helpful to use a concrete example because it's harder to think in terms of "namespaces", "verbs", and "entities" (the button click schema is a good specific example).

Creating a new schema repository

How often are we going to be creating a new schema repository? I though all the analytics stuff would be in one big repo and we'd just add individual directories and files, so this seems mostly irrelevant (particularly to an analyst).

Oh, on second though, maybe this is meant for developers who want to test on their own machine? Giving more context about when you'd follow these steps would be helpful.

Materializing the schema

"Materializing" seems unnecessarily jargony. From the user's perspective, it's just creating a file.

There are a few more common meta fields that WMF defines, but we don't need explain them all here. For now we will write out just these 2 example meta fields.

Why are manually writing out the meta fields in this example when (if I understand correctly) for analytics schemas, we're always going to use a reference to the common schema?

These are mocha tests, so all we need to do is run npm test.

Before we commit, I assume? Maybe this should integrated into each section so for example, it's in the "modifying schemas" instructions at the appropriate point.

Changing an existent version should not be done (unless you really know what you are doing).

It would be helpful to include an example of when it should be done, so people can check if they really do know what they're doing. Is it that you shouldn't modify unless you're sure that the schema isn't being used by any producers or consumers yet?

Also every event should belong to a certain dataset or stream of events. Each event needs to specify which stream it belongs to.

So, yeah, I've heard that in the new system a stream can have separate schemas for separate events, but I don't understand exactly what this means. Maybe you could give an example of events in the same stream that should be represented as different schemas. Also, does this mean there's no longer a one-to-one correspondence between database tables and streams? Because the only point of having separate schemas would be that the schemas would be different, so they couldn't be combined into one table.

jsonschema-tools can be configured with multiple baseSchemaUris, the default of which is just the schemaBasePath

This is the first mention of these properties. It took me a while to realize that they're probably meant to go into package.json.

Oh, no, I see it's supposed to go in .jsonschema-tools.yaml. I think I missed that because it's kinda hidden in the middle of a code snippet. Explaining explicitly how jsonschema-tools is supposed to be configured and what the options are would be helpful.

jsonschema-tools is a NodeJS libary and CLI for managing JSONSchema git repositories. To create a new schema repository, you'll create a package.json file, install and configure jsonschema-tools, and set up jsonschema-tools tests.

Installing jsonschema-tools is only mentioned in the context of creating a schema. Do we not need it for, say, running tests after modifying a schema too?

And a few more comments here (because I accidentally saved that last comment too early):

jsonschema-tools is a NodeJS libary and CLI for managing JSONSchema git repositories.

It would be helpful to explain what the user should install if they don't have any NodeJS tools on their system already. I just checked and I don't have npm available in the terminal of my Mac.

Creating a new schema

This section doesn't explain which field/validation types are possible with a JSON schema. Maybe you could just link to a good documentation page that explains it?

Okay, that's everything I can think of right now. Feel free to ping me again if you have other questions or new drafts you'd like to review!

Thanks for comments!

So...we can't reliably upgrade producer and consumer code? That seems bad; did we work around this limitation somehow?

Read on! This is covered, no?
"WMF has developed the [https://github.com/wikimedia/jsonschema-tools jsonschema-tools] library to aide developing versioned and backwards compatible schemas in git."

I changed the part you quoted to:
"...it does not have schema evolution built in. That is, to JSONSchema, each schema is distinct..."

I had to think for a little bit before I realized this was the file layout. Also, it would be helpful to use a concrete example

Ok, changed it to use concrete examples.

How often are we going to be creating a new schema repository?

Not often! But I wanted to show how it is done, and that there is nothing special about our schema repositories other than they use jsonschema-tools during development and testing. Aside from that they are just git repos.

I added "Most likely you will already be working with a schema repository. If so, skip to [[Event_Platform/Schemas#Creating_a_new_schema|Creating a new schema]] or [[Event_Platform/Schemas#Modifying_schemas|Modifying schemas]]."

"Materializing" seems unnecessarily jargony. From the user's perspective, it's just creating a file.

Hm, it more than that. Materializing takes the current.yaml file and dereferences and merges it to create a canonical derefernced verison file. The process of materializing is like rendering a template into a static file. Or compiling. We chose to use the term 'materializing' in jsonschema-tools to differentiate. I added some information on what jsonschema-tools means by 'materializing'.

Oh, no, I see it's supposed to go in .jsonschema-tools.yaml

Added this explicitly, and also linked to https://github.com/wikimedia/jsonschema-tools#jsonschema-tools-config-files

Creating a new schema

This section doesn't explain which field/validation types are possible with a JSON schema. Maybe you could just link to a good documentation page that explains it?

I'm not sure that is relevant here? There are links to JSONSchema documentation. I added a link to a JSONSchema tutorial here.

Why are manually writing out the meta fields in this example when (if I understand correctly) for analytics schemas, we're always going to use a reference to the common schema?

Read on! This is just a simple example; <tt>$ref</tt> is explained later.

These are mocha tests, so all we need to do is run npm test.

Before we commit, I assume? Maybe this should integrated into each section so for example, it's in the "modifying schemas" instructions at the appropriate point.

Not necessarily! You can commit or run npm test. If you want to run npm test on your schema modifications in current.yaml before commiting, you'll need to materialize manually instead of relying on the git pre-commit hook to do it:

$(npm bin)/jsonschema-tools materialize-modified

Will do it. I'm not sure what the best development practice is here. Do you think I should change the docs so it explicitly says to run jsosnschema-tools materialize modified and then npm test before committing, and also just note that materialize-modified gets run on git commit? Or hm. I could make both materialize-modified AND npm test be run by the pre commit hook. If the tests fail the commit will fail too.

Changing an existent version should not be done (unless you really know what you are doing).

It would be helpful to include an example of when it should be done

I'd rather just tell people not to do it in docs, but work with them if they really have to.

I've heard that in the new system a stream can have separate schemas for separate events, but I don't understand exactly what this means.

Added some more info and some examples.

Also, does this mean there's no longer a one-to-one correspondence between database tables and streams?

No, the meta.stream will end up being the Hive table name. There may be multiple Kafka topics underlying any given stream name. This just means that there is no longer a one-to-one corresepondance between event schemas and database tables.

Installing jsonschema-tools is only mentioned in the context of creating a schema. Do we not need it for, say, running tests after modifying a schema too?

Added "[https://github.com/wikimedia/jsonschema-tools jsonschema-tools] will be used in the rest of this documentation to set up and develop schemas in a git schema repository. Please skim the [https://github.com/wikimedia/jsonschema-tools#jsonschema-tools jsonschema-tools README] before proceeding." near the top.

It would be helpful to explain what the user should install if they don't have any NodeJS tools on their system already.

Added a link to https://nodejs.org/en/.

Thanks for comments!

Thanks for working on this! 😁

So...we can't reliably upgrade producer and consumer code? That seems bad; did we work around this limitation somehow?

Read on! This is covered, no?
"WMF has developed the [https://github.com/wikimedia/jsonschema-tools jsonschema-tools] library to aide developing versioned and backwards compatible schemas in git."

I changed the part you quoted to:
"...it does not have schema evolution built in. That is, to JSONSchema, each schema is distinct..."

Well, I think the answer was there but the link wasn't explicit. If the reader has the background knowledge they can make that connection themselves, but it's nice to spell it out explicitly.

And maybe the larger point I was groping towards was that in places that section focuses on how we made the choices we did and what the alternatives were, whereas what the user most needs to understand is why the architecture is the way it is now. It is valuable to record that history, but I would say a user-facing document isn't the best place for it.

So maybe that section could say something like: "Schemas are essential to a data streaming platform because....Our data streaming platform uses JSONSchema, but JSONSchema does not have any built-in features for schema evolution. Therefore, each change (even a small one) requires the creation of a totally separate JSONSchema file. To make that easy, we've developed the jsonschema-tools library..."

Creating a new schema

This section doesn't explain which field/validation types are possible with a JSON schema. Maybe you could just link to a good documentation page that explains it?

I'm not sure that is relevant here? There are links to JSONSchema documentation. I added a link to a JSONSchema tutorial here.

Well, yes, the JSONSchema site is definitely easy to find! But I think the basic question I had here was: when writing a schema, which data types and validation constraints can I use? I just spent some time with the docs trying to find the best page to answer that. All the pages linked from the learn page are examples, so they use some data types and constraints but aren't an exhaustive list. Then (choosing from the options in the top toolbar), I checked out the specifications themselves, which do have the full list but are very dense.

Then, from another tab I had open (because it's not linked from the main JSONSchema site) I found the reference section of "Understanding JSONSchema", which was exactly what I wanted. You've linked there to explain specific properties (which is very helpful), but didn't call it out as a general overview/reference list, so I didn't realize it was there 😁

Why are manually writing out the meta fields in this example when (if I understand correctly) for analytics schemas, we're always going to use a reference to the common schema?

Read on! This is just a simple example; <tt>$ref</tt> is explained later.

Yeah, definitely. But in terms of sequencing, I don't think it's necessary to relegate the part on referencing to the end of the document, since the basic concept of $ref isn't that complex. Here, the simple example isn't that much simpler, so why not show the version that's closer to kind readers will actually be writing?

I'm not sure what the best development practice is here. Do you think I should change the docs so it explicitly says to run jsosnschema-tools materialize modified and then npm test before committing, and also just note that materialize-modified gets run on git commit? Or hm. I could make both materialize-modified AND npm test be run by the pre commit hook. If the tests fail the commit will fail too.

I do like the idea of adding the tests to the pre-commit hook! Very easy from the user's point of view. We would probably need a way to bypass tests though; if we do a mass import of current schemas there's going to be a lot that doesn't pass (like camel case field names).

Changing an existent version should not be done (unless you really know what you are doing).

It would be helpful to include an example of when it should be done

I'd rather just tell people not to do it in docs, but work with them if they really have to.

Yeah, fair point. In that case maybe change it to "should not be done (if you think you need to do it, get in touch with Analytics)."

I've heard that in the new system a stream can have separate schemas for separate events, but I don't understand exactly what this means.

Added some more info and some examples.

I read it—thank you, very helpful!

Also, one final high-level thought: the fundamental processes of creating and modifying a schema are largely similar. The bigger difference is between working on a schema (which requires understanding JSONSchema, the required meta properties, when and how to use $ref, and so on) and committing your work (which requires understanding Git, jsonschema-tools, when to run npm test and so on). Maybe that distinction would be a better organizing principle for that part of the doc?

So maybe that section could say something like [...]

Nice I like it! Modified.

I don't think it's necessary to relegate the part on referencing to the end of the document, since the basic concept of $ref isn't that complex. Here, the simple example isn't that much simpler, so why not show the version that's closer to kind readers will actually be writing?

Hm. Indeed. I think I wanted this document to be more of a generic 'how it all works' document. The examples here are assuming any generic schema repository, that may or may not have the same /common schema that we do. Perhaps when we create the new analytics schema repository, I can write up another sub page with some more practical examples?

I do like the idea of adding the tests to the pre-commit hook! Very easy from the user's point of view. We would probably need a way to bypass tests though; if we do a mass import of current schemas there's going to be a lot that doesn't pass (like camel case field names).

Schemas can be skipped by [[ https://github.com/wikimedia/jsonschema-tools#jsonschema-tools-config-files | configuring .jsonschema-tools.yaml ]] with ignoreSchemas: ['/bad/schema/1.0.0'].

However, I'm not sure. I'm a little worried that git commit then failing because an unexpected automated test might be confusing to users. But, maybe they'd get used to it?

Yeah, fair point. In that case maybe change it to "should not be done (if you think you need to do it, get in touch with Analytics)."

Done.

Maybe that distinction would be a better organizing principle for that part of the doc?

Hm, that is a good idea. I think perhaps a 'analytics event platform tutorial' page will be better for this. How's that sound?

This is clearly still open—maybe I shouldn't assume that my recommendations are immaculate and will be implemented without discussion 😂

I don't think it's necessary to relegate the part on referencing to the end of the document, since the basic concept of $ref isn't that complex. Here, the simple example isn't that much simpler, so why not show the version that's closer to kind readers will actually be writing?

Hm. Indeed. I think I wanted this document to be more of a generic 'how it all works' document. The examples here are assuming any generic schema repository, that may or may not have the same /common schema that we do. Perhaps when we create the new analytics schema repository, I can write up another sub page with some more practical examples?

Hmm, interesting. The description of that /common schema is "common schema fields for all WMF schemas". So when you say "generic", are you thinking about people using our tooling (like jsonschema-tools and eventgate) outside Wikimedia?

If so, I think this documentation should be focused on the system as it exists at Wikimedia. In many ways, it already is (for example, this $ref section is very Wikimedia-specific no matter where we put it) and even if generic use develops, Wikitech would be a bad place to point generic users since it's so intentionally Wikimedia-specific. It seems like you're already writing good generic documentation in the repo readmes, so why not optimize the Wikitech documentation for using the tools within the Wikimedia context?

I do like the idea of adding the tests to the pre-commit hook! Very easy from the user's point of view. We would probably need a way to bypass tests though; if we do a mass import of current schemas there's going to be a lot that doesn't pass (like camel case field names).

Schemas can be skipped by [[ https://github.com/wikimedia/jsonschema-tools#jsonschema-tools-config-files | configuring .jsonschema-tools.yaml ]] with ignoreSchemas: ['/bad/schema/1.0.0'].

However, I'm not sure. I'm a little worried that git commit then failing because an unexpected automated test might be confusing to users. But, maybe they'd get used to it?

Yeah, there might be a little extra confusion for us analysts. But in comparison to the (ultimately worthwhile) confusion of adding Git and jsonschema-tools to the equation, it seems pretty minor and I think the benefit of having our work automatically checked for us, without having to read instructions on how to test, more than balances it out. The most important thing would be making the failure messages as clear as possible.

Yeah, fair point. In that case maybe change it to "should not be done (if you think you need to do it, get in touch with Analytics)."

Done.

Maybe that distinction would be a better organizing principle for that part of the doc?

Hm, that is a good idea. I think perhaps a 'analytics event platform tutorial' page will be better for this. How's that sound?

Well, this page already _is_ a tutorial, right? A page like that seems like it would contain a lot of duplicate information, and that would mean more confusion for users about which page to follow and more work updating information that exists in multiple places.

Neil good points, I think you are right here. I will change this page to be more about using our schema repositories. Right now we just have the one for production events. When we get the analytics one up, I'll edit this and modify it with specific examples to use our analytics schema repo.

Hm, that is a good idea. I think perhaps a 'analytics event platform tutorial' page will be better for this. How's that sound?

Ok @nshahquinn-wmf, I have done what I said I would do 8 months ago!

https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To

Still a WIP, but it assumes a bit of knowledge of EventLogging and explains how to use the Event Platform components with it to create a new instrumentation stream. Please review and let me know what you think!
Also ping @Mayakp.wiki and @jlinehan and @mpopov for comments too.

I've still got tons of old EventLogging docs to update.
I'm going to re-title this task to be about reviewing Event Platform docs for product in general.

Ottomata renamed this task from Review draft Modern Event Platform schema guidelines to Review Event Platform instrumentation documentation for Product teams.May 20 2020, 8:26 PM
Ottomata updated the task description. (Show Details)
Ottomata added a project: Analytics-Kanban.
Ottomata renamed this task from Review Event Platform instrumentation documentation for Product teams to Write and update Event Platform instrumentation documentation for Product teams.May 20 2020, 8:28 PM
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.
Ottomata moved this task from In Progress to In Code Review on the Analytics-Kanban board.

Today I joined the Tech Documentation office hours. Notes: