Page MenuHomePhabricator

Modern Event Platform (TEC2)
Open, MediumPublic0 Estimate Story Points

Description

This is a parent task for the work to be done for the Modern Event Platform Program.

EventLogging is home grown, and was not designed for purposes other than low volume analytics in MySQL databases. However, the ideas it was based on are solid and convergently have become an industry standard, often called a Stream Data Platform. In the last two years, we have been developing the EventBus sub-system with the aim of standardizing events to be used both internally for propagating changes to update the dependent artifacts as well as exposing them to clients. While this has been a success, integrating these events with different systems requires much custom and cumbersome glue code. There exist open source technologies for integrating and processing streams of events.

Engineering teams should be able to quickly develop features that are easy to instrument and measure, as well as for those features to react to events from other systems.

As a way to begin the process of understanding existing challenges with EventLogging, we have created the following document: https://docs.google.com/spreadsheets/d/1M1A4YEdlF0T79KgQO7g4_jpzNSe-XCn3lO0_TzhO6yQ/edit?ts=5ae7bc8a#gid=0. This document is meant to list out all the steps to instrumenting and analyzing with EventLogging, indicate which ones are the most time-consuming and error-prone, identify which teams participate, and be specific about the challenges in each step.

This program also overlaps with the Better Use of Data program. See also https://docs.google.com/spreadsheets/d/16cALJVeql2euSad3GgXJjDCOVYsBRC64ietw8oRzsbI/edit#gid=0.

For some historical context see the slides at Event Infrastructure at WMF (2018).

Background Reading

Components

Each of the components described below are units of technical output of this program. They are either deployed services/tools, or documentation and policy guidelines.

Let's first define a couple of terms before the individual technical components are detailed below.

  • Event - A strongly typed and schemaed piece of data, usually representing something happening at a definite time. E.g. revision-create, user-button-click, page-load, etc.
  • Stream - A contiguous (often unending) collection of events (loosely) ordered by time.
Stream Intake Service

from internal and external clients (browsers & apps). EventLogging + EventBus do some of this already, but are limited in scope and scale. This is EventGate.

Event Schema Registry

This is comprised of several git repositories, all pulled together and easily accessible over a simple HTTP service / filebrowser. It may eventually also have a nice GUI.

Event Schema Guidelines

Some exist already for analytics purposes, some exist for mediawiki/event-schemas. We should unify these.

Stream Connectors for ingestion to and from various state stores

(MySQL, Redis, Druid, Cassandra, HDFS, etc.) This will likely be Kafka Connect. We will need to adapt Kafka Connect to work with JSONSchemas and our Event Schema Repository.

Stream Configuration Service

Product needs the ability to have more dynamic control over how client side producers of events are configured. This includes things like sampling rate, time based event producing windows etc. (This component was originally conceived of as part of the Event Schema Repository component. It is complex and architecturally different enough to warrant its own component here.)

Stream Processing system with dependency tracking system conceptual design

Engineers should have a standardized way to build and deploy and maintain stream processing type jobs, for both analytics an production purposes. A very common use of stream processing at WMF is change-propagation, which to do well requires a dependency tracking mechanism, a very long term goal. We want to choose stream processing technologies that work toward this goal.

This component is the lowest priority of the Modern Event Platform, and as such will have more thought and planning towards the end of the program.

See also:

Timeline

FY2017-2018
  • Q4: Interview product and technology stakeholders to collect desires, use cases, and requirements.

FY2018-2019
  • Q1: Survey and choose technologies and solutions with input from Services and Operations.
  • Q2: Begin implementation and deployment of some chosen techs.
  • Q3: Deployment of eventgate-analytics stream intake service - T206785,
  • Q4: Deployment of eventgate-main stream intake service - T218346
  • Q4: Decommission Avro streams in favor of eventgate-analytics JSON based ones, T188136
  • Q4: (new) CI support for event schemas repo - T206814

FY2019-2020
Stream Intake Service - T201068

Migrate Mediawiki EventBus events to eventgate-main & deprecate eventlogging-service-eventbus

  • Q1: Continue migrating events to eventgate-main - T211248
  • Q2: Decomission eventlogging-service-eventbus (Done in Q1)
Event Schema Registry - T201063
  • Q1: Schema repository hooks to generate dereferenced canonical version - T206812
  • Q2: Support $ref in JSONSchemas - T206824
  • Q2/Q3: Set up public HTTP endpoint for - T233630
  • Q2/Q3: Create a new 'analytics' schema repository
Stream Configuration Service - T205319
  • Q1: start planning with Audiences - Design Document
  • Q2: implementation prototype - T233634
  • Q3: Deployment and use by EventLogging and eventgate-analytics-external
Replace EventLogging Analytics

This is a long term project to be worked on in collaboration with Audiences engineers which includes work on the Event Schema Repositories and Event Stream Configuration Service components.

  • Q1: Begin planning this work with Audiences - Design Document
  • Q2: Coding work on all of these pieces (e.g. client side library to use Stream Config and POST to eventgate) - T228175
  • Q2-Q4: deployment of Stream Config Service and some usages of external eventgate
  • Q4: Begin migrating existent EventLogging streams to EventGate - T238230 and T238138

See also: T225237: Better Use of Data

Stream Connectors

NOTE: 2019-09: This work is stalled due to licensing issues with Confluent's HDFS Connector

  • Q1: Kafka Connect development work (Kubernetes? YARN? Standalone?) - T223626
  • Q2: Kafka Connect deployment
  • Q2-Q4: Replace usages of Camus HDFS with Kafka Connect HDFS - T223628
Stream Processing System & Dependency Tracking

NOTE: 2019-11: This work is stalled due to lack of owner for dependency tracking
Work for next year:

  • collect basic requirements
  • Figure out if a streaming platform + graph db support basic requirements at scale

Use case collection

  • JADE for ORES
  • Fundraising banner impressions pipeline
  • WDQS state updates
  • Job Queue (implementation ongoing)
  • Frontend Cache (varnish) invalidation
  • Scalable EventLogging (with automatic visualization in tools (Pivot, etc.))
  • Realtime SQL queries and state store updates. Can be used to verify real time that events have what they should/are valid
  • Trending pageviews & edits
  • Mobile App Events
  • ElasticSearch index updates incorporating new revisions & ORES scores
  • Automatic Prometheus metric transformation and collection
  • Dependency tracking transport and stream processing
  • Stream of reference/citation events: https://etherpad.wikimedia.org/p/RefEvents
  • Client side error logging rate limiting and de-duping via Stream Processing - T217142

(...add more as collected!)

  • Stream processing: Filtering exit text stream for specific keywords
  • Stream processing: diff stream
  • Stream processing: revision token stream, for ORES and for search.
  • Stream processing: realtime historical data endpoint T240387: MW REST API Historical Data Endpoint Needs

WIP Diagram Here: https://www.lucidchart.com/documents/view/ca3f0d6b-9b45-4524-aed7-299e38908d0f

Related Objects

StatusSubtypeAssignedTask
OpenOttomata
OpenNone
OpenNone
DeclinedNone
DuplicateNone
ResolvedKrinkle
OpenOttomata
ResolvedOttomata
OpenOttomata
ResolvedOttomata
ResolvedPchelolo
ResolvedPchelolo
DeclinedNone
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedayounsi
ResolvedOttomata
ResolvedOttomata
OpenOttomata
OpenOttomata
OpenOttomata
OpenOttomata
DuplicateOttomata
DuplicateOttomata
OpenOttomata
OpenNone
ResolvedOttomata
DuplicateOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedsbassett
ResolvedOttomata
StalledOttomata
Declinedakosiaris
OpenOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedPchelolo
ResolvedOttomata
ResolvedEvanProdromou
ResolvedEBernhardson
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DeclinedNone
ResolvedOttomata
DeclinedNone
ResolvedOttomata
ResolvedPchelolo
ResolvedOttomata
Resolvedakosiaris
ResolvedOttomata
ResolvedOttomata
ResolvedPchelolo
DuplicatePchelolo
ResolvedPchelolo
ResolvedHalfak
ResolvedPchelolo
ResolvedPchelolo
ResolvedPchelolo
OpenNone
ResolvedOttomata
ResolvedPchelolo
ResolvedOttomata
Resolved MarcoAurelio
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
OpenNone
Resolvedjlinehan
Openjlinehan
ResolvedOttomata
OpenNone
OpenNone
OpenOttomata
OpenOttomata
DuplicateNone
OpenOttomata
DeclinedNone
OpenOttomata
OpenNone
OpenOttomata
OpenOttomata
OpenNone
OpenNone
OpenOttomata
Resolvedmpopov
ResolvedOttomata
ResolvedOttomata
OpenNone
DeclinedNone
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
OpenOttomata
OpenNone
OpenOttomata
OpenOttomata
OpenOttomata
Openjlinehan
Openjlinehan
Openmpopov
Openjlinehan
Openjlinehan
OpenOttomata
Openjlinehan
Openjlinehan
Openjlinehan
Openjlinehan
Openjlinehan
Openjlinehan
Openjlinehan
Openmpopov
Openjlinehan
OpenOttomata
Openjlinehan

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ottomata updated the task description. (Show Details)Jun 29 2018, 8:10 PM

Thanks for the great summary, @Ottomata!

  • It is currently difficult to debug EventLogging events. Analysts want a browser developer plugin to view the events they emit.

I would just add that another difficulty of debugging can be forcing yourself into the sample bin (particularly if the sample rate is very low).

Seconded! Thanks, @Ottomata!

It is currently difficult to debug EventLogging events. Analysts want a browser developer plugin to view the events they emit.

Such a thing exists but likely needs a little attention. It was introduced by Erik Bernhardson in order to "[show] how an eventlogging schema works to other team members such as analysts that will be reviewing the data."

Here's an example:

Notes

  • It's enabled by setting the hidden eventlogging-display-web user option to 1. This introduces some caveats:
  • Because it's a user option, it can only be enabled for logged in users
  • Because it's a hidden user option it can only be enabled via the MediaWiki API…
  • The user option is per-user per-wiki, i.e. setting the option on enwiki doesn't set it on another dewik
  • In order to enable it you'll need to run the following snippet in your browser's console:
mw.loader.using('mediawiki.api.options')
    .then(() => new mw.Api().saveOption('eventlogging-display-web', '1'));

T188640: Make it easier to enable EventLogging's debug mode covers making it easier to enable.

Ottomata updated the task description. (Show Details)Jul 2 2018, 2:28 PM
mobrovac updated the task description. (Show Details)Jul 5 2018, 10:45 AM
Ottomata added a comment.EditedJul 9 2018, 2:45 PM

T198906: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change Just made me think of another need: auditing. At the very least, we should have a seomthing that simply counts the number of messages per topic (or event-topic grouping) per hour, and emits them to another topic. This would make it easy to write a verification/monitoring job to alert if events don't show up in expected places.

T198906: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change Just made me think of another need: auditing. At the very least, we should have a stream processing job that simply counts the number of messages per topic (or event-topic grouping) per hour, and emits them to another topic. This would make it easy to write a verification/monitoring job to alert if events don't show up in expected places.

At least for EventBus and JobQueue currently we achieve these alerts via grafana alerts.

Ottomata renamed this task from Modern Event Platform (with EventLogging of the Future (EoF)) to Modern Event Platform.Aug 2 2018, 6:38 PM
CCicalese_WMF updated the task description. (Show Details)Aug 2 2018, 9:33 PM
CCicalese_WMF renamed this task from Modern Event Platform to Modern Event Platform (TEC2).Aug 21 2018, 4:26 PM
Ottomata added a comment.EditedSep 24 2018, 5:52 PM

At the Analytics Engineering offsite last week, we were talking about how the current naming of the various Modern Event Platform components is a confusing. There is no unifying name or purpose, they are all just descriptive names.

One of the descriptive and confusing concepts I mention is a 'schema topic usage'. EventLogging users are used to refering to 'schemas' and streams of events that use those schemas as the same thing. I need to discourage this habit, as they are not the same thing. Often for analytics events, the schema and the schema usage will map one to one; i.e. there will only be a single usage of a single schema. Additionally, a schema usage itself doesn't even necessarily map to a single Kafka topic. E.g. There are multiple per-datacenter topics for mediawiki.revision-create. In the public EventStreams service, we refer to the composite topics that make up a single semantic set of events as a stream. I think we should do the same for Modern Event Platform. A 'schema topic usage' is a particular stream of events. (Note that a single stream may be made up of multiple Kafka topics, as in the case with the EventBus datacenter prefixed topics.)

'Schema metadata service' is another confusing name. This refers to the service that the Product team wants to use to configure streams. This includes configuring topic to schema mappings, client side sampling rates, etc. This component was originally going to be part of the schema repository/registry. Since gathering use cases has become much more complicated, and will be a separate component. It is still ill defined, but ultimately what Product wants is a way to configure various parts of the production of streams of events. A better name for this service would be a Stream Configuration Service.

I had originally drafted this Program under the name 'Stream Data Platform', and I'd like to bring back the usage of the term 'Stream' in the technical documents. A stream is a contiguous sequence of schemaed events. A stream has and one or more Kafka topics, and its events all conform to a single schema.

I'm renaming the components as follows:

  • Stream Intake (previously Scalable Event Intake)
  • Stream Connectors
  • Stream Processing
  • Stream Configuration
  • Event Schema Repository
  • Event Schema Guidelines

Note that the Event Schema Repository does not refer to the stream concept, as it applies at the event level. Streams will have events that all conform to a schema, but the event schemas do not need to know anything about stream configurations.

I'm going to edit this parent task, and also the wording in existing sub tasks.

Ottomata updated the task description. (Show Details)Sep 24 2018, 5:58 PM
Ottomata updated the task description. (Show Details)Sep 24 2018, 6:05 PM
Ottomata updated the task description. (Show Details)Sep 24 2018, 6:30 PM

One of the descriptive and confusing concepts I mention is a 'schema topic usage'. EventLogging users are used to refering to 'schemas' and streams of events that use those schemas as the same thing. I need to discourage this habit, as they are not the same thing.

👏 👏 👏

I've been working on T202437 recently and saying things like "the visual editor doesn't log these events to the Edit schema". It's been in the back of my mind that this doesn't really make sense, since you a schema is a data model, not a log or database, but I wasn't confident about what terminology would be better. So I'm excited to have some best practices like these!

So let me see if I understand your proposal correctly: I might write a schema describing the data I want to capture, which an engineer would implement by writing code that emits events. As these events flow through EventLogging, Kafka, and the rest of the Analytics Event Empire (technical term), they make up an event stream (which may be channeled through multiple Kafka topics). Eventually, they are written to a Hadoop table, at which they make up an event log. Am I on the right track here?

Yes the right track for sure! I'll add that we will be using event 'stream' to very technically refer to any semantically grouped set of topics in Kafka. Once the data is written out to a static resting place, like Hive or MySQL or Cassandra or a log file or whatever it may be, I'd just call it event data. But even more so, I like Confluent and Flink's position. A static event dataset is really just a time bounded stream. So while we may not refer to the static files in Hadoop as streams usually, they can be philosophically thought of as streams too.

Anyway I don't think I would use the term 'event log' explicitly to refer to the data in Hadoop or elsewhere. Perhaps event data(set) would be more appropriate there.

Ottomata updated the task description. (Show Details)Sep 25 2018, 4:24 PM
mpopov added a subscriber: mpopov.Sep 25 2018, 7:41 PM
Tgr updated the task description. (Show Details)Sep 29 2018, 2:17 AM
Tgr added a subscriber: Tgr.Sep 29 2018, 2:22 AM

Added Javascript errors to the use cases, per the RFC IRC discussion. Specifics:

  • Largish events (due to stack traces) - not huge but well over the current GET size limit.
  • Event volume is impossible to predict or control. Normally very low, if something goes wrong then one or more event per pageview.
  • Events need to go to a non-standard location - Logstash or maybe Sentry if it exists.
  • Schema verification nice to have but not really needed.
  • Some kind of deduplication in case of high volume (generate an error hash on the client side, try to discard errors with a hash that was seen already, as opposed to sampling the most frequent error and not recording the rest at all).
phuedx added a comment.Oct 1 2018, 1:04 PM
  • Largish events (due to stack traces) - not huge but well over the current GET size limit.
  • Event volume is impossible to predict or control. Normally very low, if something goes wrong then one or more event per pageview.

Thanks to whoever brought those points up.

If the answer to answer to 1 is to use POST requests to submit _certain_ events via the /topics endpoint (taken from the diagram in the description), then it follows that we could batch-send several per-page events in one request.

AIUI this falls out of the

As an engineer, I want to batch produce many events at once so mobile apps can produce events after an offline period.

story in T201068: Modern Event Platform: Stream Intake Service.

Ottomata updated the task description. (Show Details)Oct 3 2018, 7:35 PM
Ottomata updated the task description. (Show Details)Dec 5 2018, 4:55 PM
Ottomata updated the task description. (Show Details)Dec 5 2018, 6:22 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata moved this task from Backlog to Parent Tasks on the Event-Platform board.Dec 5 2018, 10:06 PM
Ottomata updated the task description. (Show Details)Jan 17 2019, 9:38 PM
Ottomata updated the task description. (Show Details)Jan 22 2019, 7:09 PM
Ottomata updated the task description. (Show Details)May 17 2019, 3:25 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)May 17 2019, 3:55 PM
Ottomata updated the task description. (Show Details)May 23 2019, 3:01 PM
Ottomata updated the task description. (Show Details)Jun 19 2019, 4:40 PM
Ottomata updated the task description. (Show Details)Jun 28 2019, 4:59 PM
Ottomata updated the task description. (Show Details)Jun 28 2019, 5:30 PM
Ottomata updated the task description. (Show Details)Jul 1 2019, 6:31 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Jul 24 2019, 3:24 PM
Ottomata updated the task description. (Show Details)Sep 25 2019, 1:31 PM
Ottomata updated the task description. (Show Details)Sep 25 2019, 2:38 PM
Ottomata updated the task description. (Show Details)Oct 21 2019, 2:09 PM
Ottomata updated the task description. (Show Details)Nov 12 2019, 9:46 PM
Ottomata updated the task description. (Show Details)Nov 13 2019, 4:40 PM
Ottomata updated the task description. (Show Details)Thu, Jan 2, 6:55 PM
Ottomata updated the task description. (Show Details)Mon, Jan 6, 2:41 PM
Ottomata removed subscribers: Tbayer, chelsyx.
Ottomata updated the task description. (Show Details)Mon, Jan 13, 8:06 PM