Define edit related events for change propagation
Closed, ResolvedPublic
Actions

Description

Our (Services) primary focus this quarter is on enabling change propagation for edit-related events. We already track such events in a custom extension, which then creates custom jobs, which in turn performs HTTP requests to RESTBase. Instead, we would like to cover this functionality with more general-purpose events using the event bus:

article creation
article deletion
article undeletion
article edit
article rename
revision deletion / suppression
file upload

@Halfak has already created a fairly detailed list of events covering this at https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource#Relevant_events. These should be a great starting point for the discussion.

Other use cases

Change propagation between content types
- edit triggers Parsoid re-parse, which triggers mobile app service & metadata updates
Wikidata changes
- use cases: invalidate pages using specific wikidata items; keeping the Wikidata-Query-Service up to date

Considerations / questions

Naming of articles / resources vs. topics vs. subscriptions: Generally use URLs / paths as discussed in T102476 (section "Addressing of components")?

Results from meeting 2015-10-22 & follow-up discussion

Framing, for all events

uri: string; path or url. Example: /en.wikipedia.org/v1/page/title/San_Francisco
id: v1 UUID; corresponding to the x-request-id header, or another primary event identifier. V1 UUIDs contain a high-resolution timestamp.
dt: ISO 8601 timestamp corresponding to the id attribute. This is redundant, but makes life easier for human readers, Hive and others.
domain: en.wikipedia.org, fr.wiktionary.org,...; No mobile variants.

Edit events

title: string
pageid: integer
revision: integer
savetime: iso 8601
Other metadata, like the user etc.
Generally, no overly sensitive information (like client IPs for authenticated edits) in primary events.
- Can be included in expanded message in separate topic, or stored separately based on reqid.

Implementation

We are hoping to integrate event production directly into MediaWiki core, rather than using an extension and hooks. This functionality should be well integrated, tested & maintained.

Details

	Subject	Repo	Branch	Lines +/-
	Basic MediaWiki events	mediawiki/event-schemas	master	+361 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Duplicate	None	T109331 Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate	None	T133819 upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	daniel	T102476 RFC: Requirements for change propagation
Resolved	• GWicke	T84923 Reliable publish / subscribe event bus
Resolved	Ottomata	T88459 Implementing the reliable event bus using Kafka
Invalid	Ottomata	T110748 Event Bus
Resolved	Ottomata	T110750 Investigate improving Confluent REST Proxy and Schema Registry for Event Bus
Resolved	Ottomata	T114443 EventBus MVP
Resolved	• mobrovac	T116247 Define edit related events for change propagation

Event Timeline

• GWicke created this task.Oct 21 2015, 11:46 PM

• GWicke raised the priority of this task from to Medium.

• GWicke raised the priority of this task from Medium to High.

• GWicke updated the task description. (Show Details)

• GWicke added projects: SRE, Event-Platform, Discovery-ARCHIVED, Epic, Analytics, Wikidata, MediaWiki-General, Services, Service-Architecture, Wikidata-Query-Service.

• GWicke set Security to None.

• GWicke added subscribers: Aklapper, Matanya, • Mattflaschen-WMF and 16 others.

• GWicke updated the task description. (Show Details)Oct 21 2015, 11:49 PM

• GWicke updated the task description. (Show Details)

• GWicke added a parent task: T102476: RFC: Requirements for change propagation.Oct 21 2015, 11:52 PM

• GWicke mentioned this in T84923: Reliable publish / subscribe event bus.

• GWicke mentioned this in T114443: EventBus MVP.Oct 21 2015, 11:55 PM

Smalyshev subscribed.Oct 22 2015, 1:16 AM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Oct 22 2015, 10:17 AM

COOL. As part of this discussion, I'd like us to think about not only fields that are relevant to edit events, but also those fields that might be useful for most, if not all, standardized WMF events to have. These might be required for all events to share. Things like timestamp (what format, what name?), hostname (from where the event originated), etc. See https://meta.wikimedia.org/wiki/Schema:EventCapsule for how EventLogging does it.

There might be 2 levels here: fields that all events have, and fields that most events coming from Mediawiki have. Things like project, username, and I don't know what else.

etherpad from today's meeting:

https://etherpad.wikimedia.org/p/eventbus-events

Some notes from the meeting:

Framing, for all events

uri: string; path or url. Example: /en.wikipedia.org/v1/page/title/San_Francisco
reqid: v1 UUID; corresponding to the x-request-id header, or another primary event identifier. V1 UUIDs contain a high-resolution timestamp.
domain: en.wikipedia.org, fr.wiktionary.org,...; No mobile variants.

Edit events

title: string
pageid: integer
revision: integer
savetime: iso 8601
Other metadata, like the user etc.
Generally, no overly sensitive information (like client IPs for authenticated edits) in primary events.
- Can be included in expanded message in separate topic, or stored separately based on reqid.

Implementation

We are hoping to integrate event production directly into MediaWiki core, rather than using an extension and hooks. This functionality should be well integrated, tested & maintained.

I'd like an actual timestamp to be part of the framing for all events too. I'm all for a reqid, (although I'd bikeshed about the name a bit), but having a standardized canonical timestamp in all events is very useful. Can we add:

dt: iso 8601 timestamp. This may be the time of the event creation, or it might be something else. It can be set by the producer.

Generally, no overly sensitive information (like client IPs for authenticated edits) in primary events.
Can be included in expanded message in separate topic, or stored separately based on reqid.

In the meeting we said that MW would generate two event streams directly, one that had more information, and another that had less, minus fields with privacy concerns.

In T116247#1747924, @Ottomata wrote:

I'd like an actual timestamp to be part of the framing for all events too. I'm all for a reqid, (although I'd bikeshed about the name a bit), but having a standardized canonical timestamp in all events is very useful. Can we add:

dt: iso 8601 timestamp. This may be the time of the event creation, or it might be something else. It can be set by the producer.

So the producer would store the same time stamp twice? UUID v1 already contains it.

Generally, no overly sensitive information (like client IPs for authenticated edits) in primary events.
Can be included in expanded message in separate topic, or stored separately based on reqid.

In the meeting we said that MW would generate two event streams directly, one that had more information, and another that had less, minus fields with privacy concerns.

Yes, if it has data to produce both of them right away, sure. To topics named something like mw-edit and mw-edit-private perhaps (where the latter contains this extra info).

So the producer would store the same time stamp twice? UUID v1 already contains it.

Could you provide an example of what this UUID would look like?

A reason for having a timestamp only field is so that applications can use it for time based logic without having to also know how to extract the timestamp out of an overloaded uuid.

Also, who is responsible for setting this reqid? In many cases, varnish, right? A producer may emit several events during a given request, and it should have to ability to set what it considers to be the real timestamp of each event.

topics named something like mw-edit and mw-edit-private perhaps (where the latter contains this extra info).

I'd prefer if we did this the other way around. The 'private' topic will have more data and be the main source of truth. The public one will contain a subset of this data, and thus is subordinate to the main one.

If we offer public access to the public events of the past we need to rewrite them according to new events that hide previous public events. Can you make sure that events that hide any part of previous public events are also public? So that a public archive of events can be maintained based on the public events alone.

• GWicke added a subscriber: EBernhardson.Oct 23 2015, 5:22 PM

@Ottomata, UUIDs are described in https://en.wikipedia.org/wiki/Universally_unique_identifier. An example for a v1 UUID is b54adc00-67f9-11d9-9669-0800200c9a66. There are libraries to extract the high-resolution timestamp for most environments, as well as online services like https://www.famkruithof.net/uuid/uuidgen?typeReq=-1 for experimentation.

Regarding a separate timestamp in the framing information: Which time would this correspond to? The next version of Cassandra is likely going to track enqueue time itself & support efficient retrieval by timestamp, and enqueue time is something that should be handled in Kafka in any case. Other timestamps have event-specific semantics, like for example the MediaWiki save time, which is why I think it makes most sense to not include them in the framing information. All events should however have a unique identifier and timestamp that ties together all events triggered by the same original trigger, and can be used for per-topic de-duplication / idempotency. This is what the UUID in reqid would provide.

@JanZerebecki: Suppression information would indeed be needed for public access to older events. One option would be to key this on the event's UUID. We could also consider superseding the message using Kafka's deduplication (compaction) based on the same UUID.

As long as a separate public suppression event exists that refers to the old one it sounds fine.

In T116247#1748095, @Ottomata wrote:

So the producer would store the same time stamp twice? UUID v1 already contains it.

Could you provide an example of what this UUID would look like?

A reason for having a timestamp only field is so that applications can use it for time based logic without having to also know how to extract the timestamp out of an overloaded uuid.

Using Python as an example (and sticking strictly to what's in the standard lib):

from uuid import uuid1

u = uuid1()

print datetime.datetime.fromtimestamp((u.time - 0x01b21dd213814000L)*100/1e9)

The constant 0x01b21dd213814000 represents the number of 100-ns units between the epoch that UUIDs use (1582-10-15 00:00:00), and the standard unix epoch.

Right, but how would you do this in say, Hive? Or in bash?

Timestamp logic should be easy and immediate.

Regarding a separate timestamp in the framing information: Which time would this correspond to?

This is up to the producer, I think. If there are more timestamps needed for specific schema, that is fine, but I see a lot of value in having a canonical and easily readable timestamp. Camus uses this timestamp to auto partition files by hour when they are imported from Kafka into HDFS. ISO 8601 works, unix epoch seconds and milliseconds work. We'd have to add more code to make UUID timestamp work.

Maybe this is ok, but I'd much rather be able to use my eyes and easy tools to do time logic.

Right, but how would you do this in say, Hive? Or in bash? Timestamp logic should be easy and immediate.

Yeah, Hive's UUID support really looks pretty lacking. There seems to be some UDF code, but it's definitely not as convenient as it could be. I'm fine with additionally including the ISO 8601 timestamp corresponding to the timeuuid to help Hive & humans reading the JSON. The overhead is fairly small, and we can automate adding the timestamp if only the UUID was supplied by a producer.

I went ahead and updated the task description with the current framing / per-event schema. I renamed the reqid to just id, and added a ts field containing the same timestamp in ISO 8601 format.

I'm still a little confused about how this reqid/id will work? You are suggesting that it comes from the x-request-id that we want varnish to set, right? Won't this mean that multiple events (those produced during the same http request at varnish level) will have the same reqid?

To avoid possible conflicts, I'd suggest we call this not just id. How about uuid? That's what EventLogging capsule does: https://meta.wikimedia.org/wiki/Schema:EventCapsule

Also, this is just a personal preference, but I'd prefer if we had a convention differentiating integer/second based 'timestamps' and string/date based 'datetimes'. For webrequest data, the ISO8601 is called dt.

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

Also, over at T88459#1694274, I commented:

If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLogging does and infer and validate the schema based on this value. This would especially be helpful in associating a message with an Avro Schema when serializing into binary.

It sounds helpful to tie a topic to a schema, but I think we should be able to know the schema of a given message in Kafka by something other than having to look in the schema repository config.

I just didn't want this thought to get lost. If you disagree (which I think you do), this isn't necessary for MVP, so we can revisit it later if/when it becomes relevant.

In T116247#1752974, @Ottomata wrote:

I'm still a little confused about how this reqid/id will work? You are suggesting that it comes from the x-request-id that we want varnish to set, right? Won't this mean that multiple events (those produced during the same http request at varnish level) will have the same reqid?

That's the idea, yes, so that different requests that fire off in the system can be tied to the same request ID.

In T116247#1752975, @Ottomata wrote:

To avoid possible conflicts, I'd suggest we call this not just id. How about uuid? That's what EventLogging capsule does: https://meta.wikimedia.org/wiki/Schema:EventCapsule

I don't see a conflicting problem with id (even though id is a JSONSchema keyword, but it relates to the schema, not its properties, so we're good there). uuid is not a good choice, IMHO, it's like naming a field string because its value is a string. The most accurate name would be reqid, since that's what it is.

In T116247#1752979, @Ottomata wrote:

Also, this is just a personal preference, but I'd prefer if we had a convention differentiating integer/second based 'timestamps' and string/date based 'datetimes'. For webrequest data, the ISO8601 is called dt.

That can be seen from the fields type, I guess: if it's integer, it's a unix time stamp, otherwise an ISO8601 date. But I see your point. Will s/ts/dt/ in the description.

In T116247#1753027, @Ottomata wrote:

Also, over at T88459#1694274, I commented:

If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLogging does and infer and validate the schema based on this value. This would especially be helpful in associating a message with an Avro Schema when serializing into binary.

I'd be in favour of that as well. Two ideas:

Manual schema versions. We could increase the schema version every time we change something in the schema. Easy to achieve but it's also easy to forget to bump the version when something has been changed.
Use the git commit SHA1. Here the event bus would attach the current git commit SHA1 to the message. Also rather straightforward to achieve. The problem here might be that different messages for the same topic might have different SHA1's, but still point to the same version of the schema.

• mobrovac updated the task description. (Show Details)Oct 26 2015, 3:23 PM

I don't see a conflicting problem with id (even though id is a JSONSchema keyword, but it relates to the schema, not its properties, so we're good there). uuid is not a good choice, IMHO, it's like naming a field string because its value is a string. The most accurate name would be reqid, since that's what it is.

Ok cool, if that's the case, then reqid or even request_id (I like long names...what can I say?) sounds good. EventLogging gives every event a really unique uuid, based on the message itself, so that you can always uniquely ID any event. It mainly uses this for avoiding duplicates. Can we add this to the description too? See: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/parse.py#L69

In T116247#1749452, @Ottomata wrote:

Right, but how would you do this in say, Hive? Or in bash?

In bash:

$ sudo apt-get install uuid
$ ID=$(uuid -v 1)
$ grep "content: time" <(uuid -d $ID)
    content: time:  2015-10-26 15:16:20.026434.0 UTC

In Java (applicable to Hive?):

import java.util.Date;
import java.util.UUID;

public class Time {
    public static void main(String...args) {
        UUID id = UUID.fromString(args[0]);
        double timestamp = (id.timestamp() - 0x01b21dd213814000L)*100/1e6;
        System.out.println(new Date((long)timestamp));
    }
}

Anyway, I don't object to including the redundant iso8601 timestamp, I just wanted to make sure it was clear that it's not at all difficult to extract a timestamp from a v1 UUID (and even less onerous when you figure that code like this would be tucked away in a helper somewhere).

Manual schema versions. We could increase the schema version every time we change something in the schema. Easy to achieve but it's also easy to forget to bump the version when something has been changed.

FWIW, this is how EventLogging does it, although the revisions are managed by Mediawiki revision IDs. I'm still not sure, but I think it would make sense to keep each schema iteration present in the HEAD of the repo, so it would be fairly easy to manually bump schema version numbers.

In T116247#1753398, @Ottomata wrote:

Ok cool, if that's the case, then reqid or even request_id (I like long names...what can I say?) sounds good.

request_id works for me. I also happen to like snake_case. Let's standardise on that?

EventLogging gives every event a really unique uuid, based on the message itself, so that you can always uniquely ID any event. It mainly uses this for avoiding duplicates. Can we add this to the description too? See: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/parse.py#L69

Hm, I think duplicates should be detected based on the content of the message itself and the time stamp.

In T116247#1753399, @Eevans wrote:

I don't object to including the redundant iso8601 timestamp, I just wanted to make sure it was clear that it's not at all difficult to extract a timestamp from a v1 UUID (and even less onerous when you figure that code like this would be tucked away in a helper somewhere).

I'm not so sure actually that these will always be redundant. I think the request ID should be persisted to track the same event throughout the system. Imagine a user clicks on something which produces an event in the queue and that event triggers another one to be enqueued. Then, both of them should have the same request id, but different time stamps, shouldn't they?

Hm, I think duplicates should be detected based on the content of the message itself and the time stamp.

EventLogging explicitly uses the uuid in MySQL as a unique key for all tables. Having it standardized on a single field means that the unique index creation is standard for all events. This keeps duplicate events from ever being inserted into a single table.

MySQL unique indexes aren't really a consideration here, and I'd be fine if we could define a unique index across multiple keys, but I'd want those keys to be standardized in the framing of all events. A uuid like EventLogging does would be an easy way to do this.

If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLogging does and infer and validate the schema based on this value. This would especially be helpful in associating a message with an Avro Schema when serializing into binary.

The topic configuration will take precedence, so we wouldn't use client-supplied values for these fields, and would basically just write a part of the topic configuration into each event. We also decided that we will only evolve schemas in backwards-compatible ways. In practice, this means that we'll only add fields, and the latest schema will be able to validate both new and old data in each topic.

@Ottomata, which value do you see in recording the schema configured for a topic at enqueue time in each event?

I'm not so sure actually that these will always be redundant. I think the request ID should be persisted to track the same event throughout the system. Imagine a user clicks on something which produces an event in the queue and that event triggers another one to be enqueued. Then, both of them should have the same request id, but different time stamps, shouldn't they?

IMHO, the timestamps of the event ID and explicit timestamp (ts or dt) should always match. This makes it a lot simpler to automatically derive dt from id in the producer REST proxy. Other event-specific times (like the save time as recorded by MediaWiki) should imho go into the event body.

If we have a use case for emitting two secondary events *to the same topic* that were both triggered by the same primary event (user click / request id), then we can generate a new ID for at least one of those events, and record the parent event id in a separate field (ex: par_id). This way, we can get the right deduplication semantics for each of those events.

PR 5 proposes the schema definitions for the basic MW events: article edit / delete / undelete / move and revision visibility changes.

If we have a use case for emitting two secondary events *to the same topic* that were both triggered by the same primary event (user click / request id), then we can generate a new ID for at least one of those events, and record the parent event id in a separate field (ex: par_id). This way, we can get the right deduplication semantics for each of those events.

? What's the point of the request_id then? I thought we wanted X-Request-Id so that we can easily tie together events generated by the same http request.

Why not just have request_id and uuid as separate fields that always exist?

IMHO, the timestamps of the event ID and explicit timestamp (ts or dt) should always match. This makes it a lot simpler to automatically derive dt from id in the producer REST proxy. Other event-specific times (like the save time as recorded by MediaWiki) should imho go into the event body.

Why? I agree, that specific schemas can define additional timestamps, but what is the harm in having a standard one that is set and used semantically by the producer? What if I wanted to explicitly feed a topic with events dated in the past, perhaps for backfilling or recovery reasons?

What do y'all think about keeping these 'framing' fields in a nested object? I'm not sure if this is a good or bad idea. If later we decide we do want to use $ref to share common schema fields between different schemas, it'll be easier to do so if these are in a separate object.

In T116247#1754698, @Ottomata wrote:

If we have a use case for emitting two secondary events *to the same topic* that were both triggered by the same primary event (user click / request id), then we can generate a new ID for at least one of those events, and record the parent event id in a separate field (ex: par_id). This way, we can get the right deduplication semantics for each of those events.

? What's the point of the request_id then? I thought we wanted X-Request-Id so that we can easily tie together events generated by the same http request.

Why not just have request_id and uuid as separate fields that always exist?

Sure, optionally having a separate request ID (in addition to the event ID) sounds good to me as well. We should always require / auto-generate the event ID (and use it for event deduplication, derived event timestamp etc), while the reqid can be added to events that are indeed request-triggered.

IMHO, the timestamps of the event ID and explicit timestamp (ts or dt) should always match. This makes it a lot simpler to automatically derive dt from id in the producer REST proxy. Other event-specific times (like the save time as recorded by MediaWiki) should imho go into the event body.

Why? I agree, that specific schemas can define additional timestamps, but what is the harm in having a standard one that is set and used semantically by the producer? What if I wanted to explicitly feed a topic with events dated in the past, perhaps for backfilling or recovery reasons?

That's exactly what the event ID and dt should support well. MW edit timetamps are low resolution, and in a custom format, which imho makes them less than ideal for general event ids / timestamps.

Ok, cool, I'm cool with that, so:

request_id - UUID1 from Varnish, not necessarily unique for an individual event
event_id (or maybe just uuid? since that is what EL uses?) - Actual UUID for an event.
dt - IS08601 timestamp, usually derived from the timestamp in event_id

In T116247#1754709, @Ottomata wrote:

What do y'all think about keeping these 'framing' fields in a nested object? I'm not sure if this is a good or bad idea. If later we decide we do want to use $ref to share common schema fields between different schemas, it'll be easier to do so if these are in a separate object.

I've been thinking about it too. Ideally, we could leave these fields out of schema defs, simply reference them. But, that seems not to be in correlation with storing them in a git repo. What I see as a possible solution is to put these common fields into a separate file and let the producer proxy in front on kafka stick it into each schema def. It's somewhat ugly though (as in, less transparent). PR 5 puts them directly in each schema for now.

In T116247#1757151, @Ottomata wrote:

Ok, cool, I'm cool with that, so:

request_id - UUID1 from Varnish, not necessarily unique for an individual event
event_id (or maybe just uuid? since that is what EL uses?) - Actual UUID for an event.
dt - IS08601 timestamp, usually derived from the timestamp in event_id

+1. Sounds reasonable. Will alter the PR to include event_id as well.

I've been thinking about it too. Ideally, we could leave these fields out of schema defs, simply reference them. But, that seems not to be in correlation with storing them in a git repo. What I see as a possible solution is to put these common fields into a separate file and let the producer proxy in front on kafka stick it into each schema def.

The validator lets us register schemas corresponding to urls, which will then be used when those are referenced via $ref.

We could also use a nested object to remove some redundancy in naming:

{
  event: {
    id: '..v1 uuid..',
    ts: '2015-...',
    subject: '/some/uri',
    request_id: '...v1 uuid ..' // Optional in schema; could also move this to event specific data
  },
  // Event specific data
}

Eevans mentioned this in T116786: Integrate eventbus-based event production into MediaWiki.Oct 27 2015, 5:51 PM

• mobrovac added a parent task: T114443: EventBus MVP.Oct 27 2015, 6:24 PM

• GWicke mentioned this in T116840: Cached REST end point for imageinfo requests.Oct 27 2015, 9:39 PM

intracer subscribed.Oct 27 2015, 10:37 PM

If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLogging does and infer and validate the schema based on this value. This would especially be helpful in associating a message with an Avro Schema when serializing into binary.

The topic configuration will take precedence, so we wouldn't use client-supplied values for these fields, and would basically just write a part of the topic configuration into each event. We also decided that we will only evolve schemas in backwards-compatible ways. In practice, this means that we'll only add fields, and the latest schema will be able to validate both new and old data in each topic.

@Ottomata, which value do you see in recording the schema configured for a topic at enqueue time in each event?

Mainly for analytics purposes. For historical data and other analytics contexts, the data may be analyzed much farther down the line than from Kafka. In those contexts, the information about which topic the event came from will be lost. If we don't have to topic, we won't be able to know which schema the event was validated with.

Also, it will be cumbersome to always need to load the topic/schema config from the schema repository for analytics purposes.

@Ottomata, I think understanding the semantics of an event primarily requires knowledge of the topic. The topic in turn provides access to the schema, which describes the structure of the events. It is likely that we'll have multiple topics record similarly-structured events, which means that they might share the same schema, but describe different semantic events in each topic. For example, a basic timing event can be emitted for clicks of button A or button B, each tracked in a separate topic.

I could be convinced to include the topic name / URL in each event. One use case this could potentially help is streaming events from multiple topics. We could also handle this with a framing format, but this might force us to parse JSON on the consumer side, which wouldn't be great for performance.

Either way, given the topic name you should have no trouble accessing the schema. We can expose schemas for each topic URL in the REST API (ex: /{topic}?schema.json), which you could then store along with the event data in hadoop. Embedding an explicit schema url of the form described above might be a bit redundant, considering the simplicity of the construction.

I think understanding the semantics of an event primarily requires knowledge of the topic.

This is true if you are consuming from something that has a "topic", but what if you are downloading a historical dump of events? It seems to me that we should aim to have any event dump and live streamed event use the same schemas.

@Ottomata, I think understanding the semantics of an event primarily requires knowledge of the topic.

Hm, I don't think this is true. You will need some understanding of what a historical dataset is, but that's all. The historical datasets are going to be made up of many revisions of the same schema. January may have data validated with schema revision 1, and February with schema revision 2. If you don't know which exact schema-revision an event was validated with, you will have a hard time reasoning about what the data looks like. This will be especially true if we are attempting to do Avro conversion.

@Ottomata: Based on our backwards-compatibility rules, the latest schema will be a superset of previous schemas. This means that you will be able to understand both old and new data in a given topic using the latest schema.

Have we decided that defaults will be filled in for missing fields?

@Ottomata, they will be filled in somewhere, but I think we haven't necessarily decided on filling them in at production time. To me it seems that filling in either at production or consumption time will work, as long as defaults don't change. It sounds like you have a concern in that area, though. Could you elaborate?

Producer A has schema version 1.
Producer B has schema version 2, which has added field "name" with default "nonya".

All of these events are being imported into Hadoop. An analyst looks at the latest schema and wants to do some analysis on "name". They write a job to query the data in Hadoop selecting the "name" field. The data produced by Producer A does not have a "name" field. The analyst gets key errors and their job fails.

@Ottomata, you are basically making the case for filling in the defaults at consumption time.

Or produce time.

But really, even if we fill in defaults during production or consumption, this will still be a problem for historical data. Data is only consumed into Hadoop once, and schema changes can happen after consumption time. If you have no way of associating a particular historical event with a schema, you won't be able to fill in defaults properly for missing keys. Avro solves this during binary deserialization, by filling in the defaults from the reader's schema.

We can't avoid this problem with JSON Schema, but we can make it easier to deal with with by having a built in mapping from an event to a schema.

@Ottomata: If you fill in the defaults at consumption time, then you have a choice of how you want to treat old events. You can either fill in the defaults from the latest schema (probably what you want in most cases), or choose to explicitly distinguish fields that were not yet defined at the time the event was produced.

Events will be consumed into Hadoop close to production time (within an hour usually). Schema changes made years after the fact cannot be reflected in years old historical data unless it is reprocessed and rewritten.

Please take a look at the proposed event definitions and voice any concerns you might have. We'd like to settle on it in the next couple of days so that we can continue with our QGs.

Cool, added some comments.

• Mattflaschen-WMF unsubscribed.Nov 6 2015, 1:23 AM

Is it time to consider creating a standalone repo for these schemas? If so, then that means it is time for repo name bikeshed, woohoo!

No idea what to call this or where to put it. mediawiki/schemas?

In T116247#1799843, @Ottomata wrote:

Is it time to consider creating a standalone repo for these schemas?

In my oppinion, schemas (and documentation) should always live in the same repo as the code, so it is easier to keep them in sync, and tag and release them together.

@daniel

This schema repo will be used by many codebases. EventLogging, Mediawiki, analytics refinery, etc. etc. Anyone creating events will need these schemas.

There are various ways to share these schemas, but one idea is to use git submodules.

@Ottomata If we have good versioned dependencies between the modules, that should work too. My concern is making sure that code, specs and docs are in sync.

We have already run into many annoyances with trying to keep schemas in line across repositories. I'd be happy to be proved wrong, but I don't see any way, outside of a submoduled schema repository, to have versioned dependencies between java and php.

FYI, the repo is here, waiting for some schemas! :)

https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/event-schemas

For Avro, @EBernhardson, you can go ahead and submit a patch there in an avro/ directory. We should should probably maintain the usual Java directory hierarchy there for them.

For the incoming JSON Schemas, @mobrovac, can you push your schemas there in a jsonschema dir?

Change 254180 had a related patch set uploaded (by Mobrovac):
Basic MediaWiki events

https://gerrit.wikimedia.org/r/254180

gerritbot added a project: Patch-For-Review.Nov 19 2015, 7:39 PM

Change 254180 merged by Ottomata:
Basic MediaWiki events

https://gerrit.wikimedia.org/r/254180

@GWicke and I discussed the schema/revision in meta issue in IRC today. He had an idea that I quite like!

@GWicke suggested that instead of using (schema, revision) to uniquely ID a schema, that we just use a URI. EventLogging does this already with schemas stored in meta.wikmedia.org, but the URI resolution is done behind the scenes. Explicitly setting meta.schema to a URI in each event allows us to easily look up a schema outside of any EventLogging/EventBus context. I believe it would be easy to support this in EventLogging code as long as extracting the schema name and revision from the URI is standardized. Whatever the URI is, its last two path elements should be name/revision, e.g. .../schemas/jsonschema/{title}/{rev}.

This would certainly solve the issues that @Nuria and I had about not including schema ids in the events.

Thoughts? I'll look into the implementation of this tomorrow to make sure there isn't something that would make this difficult.

In T116247#1839888, @Ottomata wrote:

@GWicke and I discussed the schema/revision in meta issue in IRC today. He had an idea that I quite like!

@GWicke suggested that instead of using (schema, revision) to uniquely ID a schema, that we just use a URI. EventLogging does this already with schemas stored in meta.wikmedia.org, but the URI resolution is done behind the scenes. Explicitly setting meta.schema to a URI in each event allows us to easily look up a schema outside of any EventLogging/EventBus context. I believe it would be easy to support this in EventLogging code as long as extracting the schema name and revision from the URI is standardized. Whatever the URI is, its last two path elements should be name/revision, e.g. .../schemas/jsonschema/{title}/{rev}.

This would certainly solve the issues that @Nuria and I had about not including schema ids in the events.

Thoughts? I'll look into the implementation of this tomorrow to make sure there isn't something that would make this difficult.

I like it.

@Ottomata: I don't know all of the details, but I think the ID idea is a good one. One possible hitch: according to the W3C, URIs are supposed to be opaque. If you're using it as an identifier, then it's a layer violation to also require the server to provide data to the client via the URI.

I haven't studied your proposal closely enough to know if there's really a problem here, and if there is, if it's worth avoiding. There's no sense in overcomplicating things to satisfy some W3C orthodoxy here, but I know that there are some cases where URI parsing causes tighter coupling than necessary, so I figured I'd comment.

Hm, not sure I follow. We are proposing that a schema be ID-able via a URI, and also remotely locatable if that URI happens to be a full URL with schema and domain information. Is the opaqueness issue the fact that it is <name>/<revision> that IDs the schema, instead of just a unique /<schema_id>?

I think that only means that a client that gets a URL ending in '<name>/<revision>' for an API should not assume it can extract name and revision from it without asking the API.

Hm, I think I see. We are coupling the URI to the ID, which according to the W3C should not be relied upon. Ok, noted.

From my POV, the URL is the ID in the sense that it uniquely identifies the schema and its version.

Milimetric moved this task from Incoming to Radar on the Analytics board.Dec 7 2015, 6:06 PM

• Deskana moved this task from Needs triage to Tracking on the Discovery-ARCHIVED board.Dec 29 2015, 8:55 PM

I believe we can close this task, ja? Got a few defined here: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki

In T116247#1927791, @Ottomata wrote:

I believe we can close this task, ja? Got a few defined here: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki

Indeed. We are done here.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:46 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Define edit related events for change propagationClosed, ResolvedPublicActions

Description

Other use cases

Considerations / questions

Results from meeting 2015-10-22 & follow-up discussion

Framing, for all events

Edit events

Implementation

Details

Related ObjectsSearch...

Event Timeline

Framing, for all events

Edit events

Implementation

Define edit related events for change propagation
Closed, ResolvedPublic
Actions

Related Objects
Search...