Page MenuHomePhabricator

Updated schema strategy for analytics events
Open, LowPublic

Description

Having experimented with some different approaches, we've come to the view that a slightly tweaked convention for analytics schema will help us grow and evolve our capabilities more smoothly as we transition into the new platform. These changes build on prior work focused on defining individual fragments and formulating a common schema, and ties them together in a way that should make our abstractions more clear and our conventions more clean.

Today, our general convention has been that:

  • every analytics schema must include a schema fragment called 'analytics/common'
  • an analytics schema may include a number of individual schema fragments for particular sets of fields.

Our updated convention will be:

  • Every analytics stream must either use a schema called analytics/base
  • Or use a schema that extends analytics/base with additional top-level fields.

This new analytics base schema does a couple of things:

  • Includes a schema fragment called fragment/analytics/dimensions which contains all the standard 'dimension' fields that can be collected by the client library.
  • Contains a field for the event topic
  • Contains a field for the event action
  • Contains a multipurpose string:string map field for additional data

The concept of an event topic is designed to replace the idea of instrumentation producing events to a particular stream. Our system allows multiple streams to subscribe to the same events, and we have made this explicit by recognizing that these events are not being sent to streams, rather the streams are subscribing to topics, and receiving the events for those topics.

The concept of an event action is the same as its historical usage. It is convenient to group related events together, especially when they are different steps in the same funnel or workflow, or components of the same product feature. This is what the action field is for. We considered eliminating it entirely and having all events be their own topic, but were convinced that it was better to leave this pattern in place.

Streams will be able to

  • Subscribe to topics (or certain actions in a topic), in order to pick out events of interest and receive them into a database table named after the stream.
  • Specify which of the dimensions from the dimensions fragment will be filled out by the client library.

The calling interface will not change, but what today we call the "stream name" will become the "topic",

mw.eventLog.submit( 'my_simple_topic', { message: 'Hello!' } );

And we will support an optional argument with the name of the "action", when that is to be used.

mw.eventLog.submit( 'my_action_topic', 'my_action', { message: 'Hello!' } );

Event Timeline

jlinehan created this task.Nov 9 2020, 8:03 PM

Change 640208 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[schemas/event/secondary@master] analytics/base: Adds base analytics schema analytics/fragment/dimensions: Adds dimensions analytics schema.

https://gerrit.wikimedia.org/r/640208

LGoto moved this task from Triage to Tracking on the Product-Analytics board.Nov 10 2020, 6:09 PM
sdkim moved this task from Inbox to Doing on the Better Use Of Data board.Nov 16 2020, 5:33 PM

The concept of an event topic is designed to replace the idea of instrumentation producing events to a particular stream. Our system allows multiple streams to subscribe to the same events, and we have made this explicit by recognizing that these events are not being sent to streams, rather the streams are subscribing to topics, and receiving the events for those topics.

Hm, we might want to find a different word to use than 'topic' to describe this. I think we'll get very confused with Kafka's concept of a topic. In our case, a stream is made up of multiple Kafka topics, which can be subscribed to by Kafka consumers. Perhaps 'subject' or something else is better?

Ottomata added a comment.EditedDec 2 2020, 6:07 PM

Related: T263672: Figure out where stream/schema annotations belong (for sanitization and other use cases), @mpopov mentioned that you might want to use stream config to enable/disable setting of particular base fields by client libraries; we might want to do something similar for server side default settings, like automatically filling in HTTP header values (see T263466)

sdkim moved this task from Doing to Sign-off on the Better Use Of Data board.Jan 5 2021, 8:46 PM
sdkim moved this task from Sign-off to Backlog on the Better Use Of Data board.Jan 13 2021, 7:23 PM
LGoto triaged this task as Low priority.Mon, Feb 8, 7:17 PM