Modern Event Platform (TEC2)
Open, NormalPublic0 Story Points

Description

This is a parent task for the work to be done for the FY2018-2019 Modern Event Platform Program (Q1 goals).

EventLogging is home grown, and was not designed for purposes other than low volume analytics in MySQL databases. However, the ideas it was based on are solid and convergently have become an industry standard, often called a Stream Data Platform. In the last two years, we have been developing the EventBus sub-system with the aim of standardizing events to be used both internally for propagating changes to update the dependent artifacts as well as exposing them to clients. While this has been a success, integrating these events with different systems requires much custom and cumbersome glue code. There exist open source technologies for integrating and processing streams of events.

Engineering teams should be able to quickly develop features that are easy to instrument and measure, as well as for those features to react to events from other systems.

As a way to begin the process of understanding existing challenges with EventLogging, we have created the following document: https://docs.google.com/spreadsheets/d/1M1A4YEdlF0T79KgQO7g4_jpzNSe-XCn3lO0_TzhO6yQ/edit?ts=5ae7bc8a#gid=0. This document is meant to list out all the steps to instrumenting and analyzing with EventLogging, indicate which ones are the most time-consuming and error-prone, identify which teams participate, and be specific about the challenges in each step.

This program also overlaps with the Better Use of Data program. See also https://docs.google.com/spreadsheets/d/16cALJVeql2euSad3GgXJjDCOVYsBRC64ietw8oRzsbI/edit#gid=0

Background Reading

Components

Each of the components described below are units of technical output of this program. They are either deployed services/tools, or documentation and policy guidelines.

Let's first define a couple of terms before the individual technical components are detailed below.

  • Event - A strongly typed and schemaed piece of data, usually representing something happening at a definite time. E.g. revision-create, user-button-click, page-load, etc.
  • Stream - A contiguous (often unending) collection of events (loosely) ordered by time.
Stream Intake Service

from internal and external clients (browsers & apps). EventLogging + EventBus do some of this already, but are limited in scope and scale. We will either refactor EventLogging, or use something new.

Event Schema Registry

This should combine the existing EventLogging schemas with the mediawiki/event-schemas. This might be something new, or it might be improvements to the existant Mediawiki based schema repository.

Event Schema Guidelines

Some exist already for analytics purposes, some exist for mediawiki/event-schemas. We should unify these.

Stream Connectors for ingestion to and from various state stores

(MySQL, Redis, Druid, Cassandra, HDFS, etc.) This will likely be Kafka Connect. We will need to adapt Kafka Connect to work with JSONSchemas and our Event Schema Repository.

Stream Configuration Service

Product needs the ability to have more dynamic control over how client side producers of events are configured. This includes things like sampling rate, time based event producing windows etc. (This component was originally conceived of as part of the Event Schema Repository component. It is complex and architecturally different enough to warrant its own component here.)

Stream Processing system with dependency tracking system conceptual design

Engineers should have a standardized way to build and deploy and maintain stream processing type jobs, for both analytics an production purposes. A very common use of stream processing at WMF is change-propogation, which to do well requires a dependency tracking mechanism, a very long term goal. We want to choose stream processing technologies that work toward this goal.

This component is the lowest priority of the Modern Event Platform, and as such will have more thought and planning towards the end of the program.

See also:

Timeline

NOTE: This timeline is very WIP, and is only a guess of when progress will be made on different components.

FY2017-2018
  • Q4: Interview product and technology stakeholders to collect desires, use cases, and requirements.
FY2018-2019
  • Q1: Survey and choose technologies and solutions with input from Services and Operations.
  • Q2: Begin implementation and deployment of some chosen techs.
  • Q3: Deployment of Stream Intake service for production use.
  • Q4: Deployment of Stream Ingestion service for Hadoop intake (Kafka Connect with JSONSChema support).
  • Q4: Deployment of Stream Intake Service for analytics use.
FY2019-2020
  • Development of Stream Intake Service javascript client for remote clients.
  • Development of Stream Config Service and GUI.
  • Stream processing development and deployment?

Use case collection

  • JADE for ORES
  • Fundraising banner impressions pipeline
  • WDQS state updates
  • Job Queue (implementation ongoing)
  • Frontend Cache (varnish) invalidation
  • Scalable EventLogging (with automatic visualization in tools (Pivot, etc.))
  • Realtime SQL queries and state store updates. Can be used to verify real time that events have what they should/are valid
  • Trending pageviews & edits
  • Mobile App Events
  • ElasticSearch index updates incorporating new revisions & ORES scores
  • Automatic Prometheus metric transformation and collection
  • Dependency tracking transport and stream processing
  • Stream of reference/citation events: https://etherpad.wikimedia.org/p/RefEvents
  • Javascript errors

(...add more as collected!)

WIP Diagram Here: https://www.lucidchart.com/documents/view/2f3d5337-d393-4229-8e26-2f49283fc63a/0

Related Objects

StatusAssignedTask
OpenOttomata
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenOttomata
OpenOttomata
OpenOttomata
OpenNone
OpenNone
OpenNone
OpenNone
DuplicateOttomata
OpenOttomata
OpenOttomata
OpenNone
OpenOttomata
DuplicateOttomata
OpenOttomata
OpenOttomata
OpenNone
OpenOttomata
OpenNone
OpenNone
OpenPchelolo
DuplicateNone
OpenOttomata
There are a very large number of changes, so older changes are hidden. Show Older Changes
Nuria edited projects, added Analytics-Kanban; removed Analytics.Mar 8 2018, 6:39 PM
Ottomata renamed this task from Eventlogging of the Future to Modern Event Platform (AKA Event(Logging) of the Future (EoF)).Apr 27 2018, 7:00 PM
Ottomata claimed this task.
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Nuria, Milimetric, mobrovac and 3 others.
Ottomata updated the task description. (Show Details)Apr 27 2018, 7:11 PM
Ottomata updated the task description. (Show Details)
Ottomata triaged this task as Normal priority.Apr 27 2018, 8:02 PM
elukey added a subscriber: elukey.Apr 28 2018, 5:32 AM
Milimetric moved this task from Pageview API and AQS to Incoming on the Analytics board.
Milimetric moved this task from Pageview API and AQS to Incoming on the Analytics board.
Pchelolo moved this task from Backlog to watching on the Services board.Apr 30 2018, 8:42 PM
Pchelolo edited projects, added Services (watching); removed Services.
Ottomata renamed this task from Modern Event Platform (AKA Event(Logging) of the Future (EoF)) to Modern Event Platform (with EventLogging of the Future (EoF)).May 1 2018, 7:40 PM
Milimetric set the point value for this task to 0.
Ottomata updated the task description. (Show Details)May 3 2018, 6:22 PM
MMiller_WMF updated the task description. (Show Details)May 3 2018, 6:40 PM
Tbayer added a subscriber: Tbayer.May 30 2018, 8:10 PM
phuedx added a subscriber: phuedx.Jun 5 2018, 6:43 PM
Ottomata added a comment.EditedJun 29 2018, 7:35 PM

Ok! Q4 is over, and we've completed the interview process. Marshall, Dan and I spoke with analysts, product managers, and product and tech department engineers. My messy notes are here. I'll try and summarize some of the important and interesting bits that we should make sure we consider as we work on this program over the next year.

I'll categorize these distilled considerations into two: Analytics and Platform.

Analytics considerations

  • Intake should be backwards compatible. mw.log() should continue to work from client side.
  • Should this program include a generic non MW libraries (JavaScript, PHP, Python, etc.) to emit events, with a client side JavaScript one most important?
  • It is currently difficult to debug EventLogging events. Analysts want a browser developer plugin to view the events they emit.
  • Introspectability/Visibility: I have sent an event, I want to know if it made it through, or if not, where did it fail?
  • Want: Schema usage (aka topic) meta data should be queryable, configurable, and visible in a GUI. Product can work on UI, but where to store metadata?
    • Schema -> topic mapping (and topic configs? see platform considerations)
    • Per topic Privacy whitelisting (e.g. this)
    • event data governance (see below)
    • PII inclusion configuration (see below)
    • Sampling rate (see below)
    • Advanced types of targeting: by country, edit count, etc.
  • Data governance: who owns what event data? When should events be turned off? Should this be automated with an expiration date?
  • For (analytics) schema design conventions: How would we build a funnel? Breadcrumbs?
  • Need a configurable (privacy focused way) of associating disparate events. session_id, unique_user_id, request_id etc. are all very very useful, but might have privacy concerns. We should support as much as we can.
  • Would be good to configurably not collect PII (IP, etc.) if we don’t need it. Community Liason team would like to be able to say 'we are not collecting PII at all for event type X'.
  • Sampling. (There were many comments about sampling headaches.)
    • Sampling rate should be easily changeable, without code deployment.
    • It should be easy to know the sampling rate of event data at a given time. Perhaps sampling rate should automatically be included in event meta when event is produced?
    • Samping parameters should be configurable, not just random on page load.
      • Consistent hash field samping, e.g. hash IP + userAgent and mod 1000.
      • Random hash field samping, e.g. hash IP + userAgent and randomly choose.

Platform considerations

  • Do we need auth for new public event publish APIs?
  • Let’s at least consider if we should build a low scale generic event system for small 3rd party Mediawiki installs.
  • Will we need fine grained auth for consuming events from Kafka internally?
  • Schema/repo Event Data Manager should have topic configuration (# partitions, schema mapping, etc, and be queryable).
  • Really need a reliable way for Mediawiki to emit events. EventBus might not be good enough. (See also T120242: Reliable (atomic) MediaWiki event production) MySQL really is canonical store for MW data. What else?
    • MySQL binlog?
    • Write events to MySQL table, poll and emit events from it. Or use binlog?)
  • Consider volunteer developer happiness too.
    • Want Kafka in Cloud VPS with production (MW) events (like LabsDB but streams). T187225
    • Want stream processing in Cloud VPS (like Quarry but for streams).

Thoughts

I have two big takeaways about components for this program that I hadn't considered before.

The first is the need for a really comprehensive event data management service/site. This includes creating and editing schemas, but also event meta data around the usages of events. Here, a 'usage' would be a specific topic in Kafka (with a mapped schema)., and 'event meta data' would mean things like sampling rates, data governance etc. It isn't yet clear to me if an Event Data Manager is the same service as the schema repository. But, they do have some overlapping requirements, so it might be good to combine them. Of course we wouldn't need all of these features in a first version, and many of them might be built as part of the program this FY. We should explore to see if there are any existing open source services like this. (There is LinkedIn's WheresHow, but that might be too complex and analytics-only focused for this program.)

The second is the need for event producing client libraries for this platform. The client libraries would need to know how to do things like query the schema repo / Event Data Manager, as well as how to construct and produce events.

Ottomata updated the task description. (Show Details)Jun 29 2018, 8:10 PM

Thanks for the great summary, @Ottomata!

  • It is currently difficult to debug EventLogging events. Analysts want a browser developer plugin to view the events they emit.

I would just add that another difficulty of debugging can be forcing yourself into the sample bin (particularly if the sample rate is very low).

Seconded! Thanks, @Ottomata!

It is currently difficult to debug EventLogging events. Analysts want a browser developer plugin to view the events they emit.

Such a thing exists but likely needs a little attention. It was introduced by Erik Bernhardson in order to "[show] how an eventlogging schema works to other team members such as analysts that will be reviewing the data."

Here's an example:

Notes

  • It's enabled by setting the hidden eventlogging-display-web user option to 1. This introduces some caveats:
  • Because it's a user option, it can only be enabled for logged in users
  • Because it's a hidden user option it can only be enabled via the MediaWiki API…
  • The user option is per-user per-wiki, i.e. setting the option on enwiki doesn't set it on another dewik
  • In order to enable it you'll need to run the following snippet in your browser's console:
mw.loader.using('mediawiki.api.options')
    .then(() => new mw.Api().saveOption('eventlogging-display-web', '1'));

T188640: Make it easier to enable EventLogging's debug mode covers making it easier to enable.

Ottomata updated the task description. (Show Details)Jul 2 2018, 2:28 PM
mobrovac updated the task description. (Show Details)Jul 5 2018, 10:45 AM
Ottomata added a comment.EditedJul 9 2018, 2:45 PM

T198906: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change Just made me think of another need: auditing. At the very least, we should have a seomthing that simply counts the number of messages per topic (or event-topic grouping) per hour, and emits them to another topic. This would make it easy to write a verification/monitoring job to alert if events don't show up in expected places.

T198906: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change Just made me think of another need: auditing. At the very least, we should have a stream processing job that simply counts the number of messages per topic (or event-topic grouping) per hour, and emits them to another topic. This would make it easy to write a verification/monitoring job to alert if events don't show up in expected places.

At least for EventBus and JobQueue currently we achieve these alerts via grafana alerts.

Ottomata renamed this task from Modern Event Platform (with EventLogging of the Future (EoF)) to Modern Event Platform.Aug 2 2018, 6:38 PM
CCicalese_WMF updated the task description. (Show Details)Aug 2 2018, 9:33 PM
CCicalese_WMF renamed this task from Modern Event Platform to Modern Event Platform (TEC2).Aug 21 2018, 4:26 PM
Ottomata added a comment.EditedSep 24 2018, 5:52 PM

At the Analytics Engineering offsite last week, we were talking about how the current naming of the various Modern Event Platform components is a confusing. There is no unifying name or purpose, they are all just descriptive names.

One of the descriptive and confusing concepts I mention is a 'schema topic usage'. EventLogging users are used to refering to 'schemas' and streams of events that use those schemas as the same thing. I need to discourage this habit, as they are not the same thing. Often for analytics events, the schema and the schema usage will map one to one; i.e. there will only be a single usage of a single schema. Additionally, a schema usage itself doesn't even necessarily map to a single Kafka topic. E.g. There are multiple per-datacenter topics for mediawiki.revision-create. In the public EventStreams service, we refer to the composite topics that make up a single semantic set of events as a stream. I think we should do the same for Modern Event Platform. A 'schema topic usage' is a particular stream of events. (Note that a single stream may be made up of multiple Kafka topics, as in the case with the EventBus datacenter prefixed topics.)

'Schema metadata service' is another confusing name. This refers to the service that the Product team wants to use to configure streams. This includes configuring topic to schema mappings, client side sampling rates, etc. This component was originally going to be part of the schema repository/registry. Since gathering use cases has become much more complicated, and will be a separate component. It is still ill defined, but ultimately what Product wants is a way to configure various parts of the production of streams of events. A better name for this service would be a Stream Configuration Service.

I had originally drafted this Program under the name 'Stream Data Platform', and I'd like to bring back the usage of the term 'Stream' in the technical documents. A stream is a contiguous sequence of schemaed events. A stream has and one or more Kafka topics, and its events all conform to a single schema.

I'm renaming the components as follows:

  • Stream Intake (previously Scalable Event Intake)
  • Stream Connectors
  • Stream Processing
  • Stream Configuration
  • Event Schema Repository
  • Event Schema Guidelines

Note that the Event Schema Repository does not refer to the stream concept, as it applies at the event level. Streams will have events that all conform to a schema, but the event schemas do not need to know anything about stream configurations.

I'm going to edit this parent task, and also the wording in existing sub tasks.

Ottomata updated the task description. (Show Details)Sep 24 2018, 5:58 PM
Ottomata updated the task description. (Show Details)Sep 24 2018, 6:05 PM
Ottomata updated the task description. (Show Details)Sep 24 2018, 6:30 PM

One of the descriptive and confusing concepts I mention is a 'schema topic usage'. EventLogging users are used to refering to 'schemas' and streams of events that use those schemas as the same thing. I need to discourage this habit, as they are not the same thing.

👏 👏 👏

I've been working on T202437 recently and saying things like "the visual editor doesn't log these events to the Edit schema". It's been in the back of my mind that this doesn't really make sense, since you a schema is a data model, not a log or database, but I wasn't confident about what terminology would be better. So I'm excited to have some best practices like these!

So let me see if I understand your proposal correctly: I might write a schema describing the data I want to capture, which an engineer would implement by writing code that emits events. As these events flow through EventLogging, Kafka, and the rest of the Analytics Event Empire (technical term), they make up an event stream (which may be channeled through multiple Kafka topics). Eventually, they are written to a Hadoop table, at which they make up an event log. Am I on the right track here?

Yes the right track for sure! I'll add that we will be using event 'stream' to very technically refer to any semantically grouped set of topics in Kafka. Once the data is written out to a static resting place, like Hive or MySQL or Cassandra or a log file or whatever it may be, I'd just call it event data. But even more so, I like Confluent and Flink's position. A static event dataset is really just a time bounded stream. So while we may not refer to the static files in Hadoop as streams usually, they can be philosophically thought of as streams too.

Anyway I don't think I would use the term 'event log' explicitly to refer to the data in Hadoop or elsewhere. Perhaps event data(set) would be more appropriate there.

Ottomata updated the task description. (Show Details)Sep 25 2018, 4:24 PM
mpopov added a subscriber: mpopov.Sep 25 2018, 7:41 PM
Tgr updated the task description. (Show Details)Sep 29 2018, 2:17 AM
Tgr added a subscriber: Tgr.Sep 29 2018, 2:22 AM

Added Javascript errors to the use cases, per the RFC IRC discussion. Specifics:

  • Largish events (due to stack traces) - not huge but well over the current GET size limit.
  • Event volume is impossible to predict or control. Normally very low, if something goes wrong then one or more event per pageview.
  • Events need to go to a non-standard location - Logstash or maybe Sentry if it exists.
  • Schema verification nice to have but not really needed.
  • Some kind of deduplication in case of high volume (generate an error hash on the client side, try to discard errors with a hash that was seen already, as opposed to sampling the most frequent error and not recording the rest at all).
phuedx added a comment.Oct 1 2018, 1:04 PM
  • Largish events (due to stack traces) - not huge but well over the current GET size limit.
  • Event volume is impossible to predict or control. Normally very low, if something goes wrong then one or more event per pageview.

Thanks to whoever brought those points up.

If the answer to answer to 1 is to use POST requests to submit _certain_ events via the /topics endpoint (taken from the diagram in the description), then it follows that we could batch-send several per-page events in one request.

AIUI this falls out of the

As an engineer, I want to batch produce many events at once so mobile apps can produce events after an offline period.

story in T201068: Modern Event Platform: Stream Intake Service.

Ottomata updated the task description. (Show Details)Oct 3 2018, 7:35 PM
Ottomata updated the task description. (Show Details)Wed, Dec 5, 4:55 PM
Ottomata updated the task description. (Show Details)Wed, Dec 5, 6:22 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata moved this task from Backlog to Parent Tasks on the EventBus board.Wed, Dec 5, 10:06 PM