Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate
Open, NormalPublic8 Story Points

Description

A Modern Event Platform goal this quarter is deploy an EventGate service and produce the Monolog+Avro based events to it. We'll design new JSONSchema-ed events that represent the same data as the existent Avro one's. We'll then modify the EventBus extension to produce these events.

https://github.com/wikimedia/mediawiki-event-schemas/tree/master/avro/mediawiki

This ticket does not encompass replacing the Hadoop based processing of this data, just producing the new ones. We'll have another ticket for using the new data, and for decommissioning the Avro stuff.

Ottomata created this task.Jan 17 2019, 8:20 PM
Ottomata triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2019, 8:20 PM
Ottomata moved this task from Backlog to Next Up on the EventBus board.Jan 17 2019, 9:28 PM
Pchelolo moved this task from Backlog to watching on the Services board.Jan 17 2019, 9:41 PM
Pchelolo edited projects, added Services (watching); removed Services.

I had in mind that one of the reason for using avro originally was data size. Currently the mediawiki_ApiAction takes ~1T per month and the mediawiki_CirrusSearchRequestSet ~2T per month. A small test with hive showed me that the growth factor from avro to json would be of ~4.5 for mediawiki_CirrusSearchRequestSet and ~3 for mediawiki_ApiAction (if the events don't change ). This would lead to ~12T monthly, which is largely acceptable for HDFS.
Last but not least: What about kafka?

Ottomata added a comment.EditedTue, Jan 22, 3:29 PM

These events (for now) will go to the jumbo-eqiad Kafka cluster. 12T is not nothing! This cluster is currently handling around 64T / month total. There's about 78T free on that cluster now, so we aren't going to have a problem storing the data (for a week at at time). We'll have to see about throughput. 12T / month will add about 5M / second in (+out). I'd imagine this will be fine, but we'll have keep our eyes on it.

Change 485885 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] [WIP] Add mediawiki/search/requestset/0.0.1 schema

https://gerrit.wikimedia.org/r/485885

Change 485893 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] [WIP] Add mediawiki/api/request/0.0.1 schema

https://gerrit.wikimedia.org/r/485893

Change 487154 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add test/event/0.0.3 with test_map example

https://gerrit.wikimedia.org/r/487154

@bd808, let's discuss your Monolog idea from https://gerrit.wikimedia.org/r/487154 here.

I just brainstormed with @Pchelolo about this. If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

It gets a little tricky though. Monolog is a modular logging pipeline, but events !== log messages. It seems that one of the purposes of the abstractions of Monolog would be to be able to use any Channel with any Handler via configuration, e.g. ApiAction could be configured to be logged to files or to Kafka with Avro or to Event(Bus|Gate) via HTTP. However, this isn't quite true. In order to support Avro, you had to add extra custom configuration to map from channel names to Avro schemas AND the Avro schemas themselves had to be shipped with the code. This means that you can't use just any Channel with the Kafka+Avro Handler/Formatter. You couldn't just configure e.g. query logs to go to Kafka+Avro, you needed a schema and extra configs. (There's even a comment in mediawiki-config about how the extra avro schema configs don't belong there.)

The same applies to Event(Bus|Gate), except since we are using JSON, we don't need the actual schemas to format the message. We do need the schema URI though (e.g. /mediawiki/api/request/0.0.1) so that EventGate will know which schema the event should be validated against.

So, I'm not sure it is such a great fit. I'd love it if there was a more generic way than our one-off EventBus extension to send events like this, but I'm not sure what it is. @Pchelolo might have more to say too. :)

If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

It was not my impression that we wanted to generalize things like this. For 'main' events we would still be using hooks unless sending events will be integrated into the core mediawiki, quite like sending jobs is integrated. In my definition 'main' events is something which another non-analytics functionality depends on. The 'ApiAction' and 'CirrusSearch' events are in that sense, not 'main' - that's basically TRACE level logging that we are sending to a different processing pipeline (Kafka) since our main pipeline (Logstash) can not handle the load. In that sense, sending non-main events via monolog makes sense.

In the perfect world, using Monolog, we'd be able to instrument MW code by just adding a structured log to, for example, ApiQueryAction and everything would work out of the box, but in our current approach, we'd need to have a mapping 'logChannel -> (kafkaTopic, evenbusInstanse, schema)' and prior to enabling it we'd need to create a schema (for Hadoop refinery). I guess this could be an ok compromise since unless we integrate event sending to MW core, the only alternative is creating a quite nonsense hook.

bd808 added a comment.Wed, Feb 6, 9:43 PM

@bd808, let's discuss your Monolog idea from https://gerrit.wikimedia.org/r/487154 here.

I just brainstormed with @Pchelolo about this. If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

I'm not quite sure where this conclusion comes from. I have not reviewed the entirety of the EventBus integration into MediaWiki core, but to me this would only be a obvious and necessary conclusion if all such structured events could reasonably be used by alternate event sinks. (More on this below.)

It gets a little tricky though. Monolog is a modular logging pipeline, but events !== log messages.

I would personally argue that structured logging is exactly the same thing as events. I actually call them "log events" rather than "log messages" exactly because using structure is meant to make operational logging something that can be readily processed programmatically rather than locking up the data and metadata in plain text files where herculean efforts with regular expressions need to be made to make the logs usable for more than light afternoon reading.

It seems that one of the purposes of the abstractions of Monolog would be to be able to use any Channel with any Handler via configuration, e.g. ApiAction could be configured to be logged to files or to Kafka with Avro or to Event(Bus|Gate) via HTTP. However, this isn't quite true. In order to support Avro, you had to add extra custom configuration to map from channel names to Avro schemas AND the Avro schemas themselves had to be shipped with the code. This means that you can't use just any Channel with the Kafka+Avro Handler/Formatter. You couldn't just configure e.g. query logs to go to Kafka+Avro, you needed a schema and extra configs. (There's even a comment in mediawiki-config about how the extra avro schema configs don't belong there.)

The same applies to Event(Bus|Gate), except since we are using JSON, we don't need the actual schemas to format the message. We do need the schema URI though (e.g. /mediawiki/api/request/0.0.1) so that EventGate will know which schema the event should be validated against.

My brain is boiling your argument down to a claim that if domain specific knowledge is needed for a storage destination then that precludes the use of generalized systems for delivery. I may very well be missing some subtlety in this distillation, but at the moment I'm going to assume that it is broadly correct.

The Monolog pipeline is designed to detach log event production from log event formatting and log event storage. It uses the pipeline pattern which is a key component of Unix system design. The event source is the use of the PSR-3 logger interface which is agnostic of the actual PSR-3 implementation in use. When that implementation is Monolog, processors can be used to augment events, and a combination of formatters and handlers can be used to route each event produced to a logger to 0-N event sinks. Many (most? all?) formatter+handler pairs do something transformative to the event to fit the expectations of the event sink or its consumers. To my mind this is the "right" way to treat log events that have a number of potential consumers. I would agree that it is likely over-designed for a bespoke system that will only ever be put to a narrow task, but as far as general purpose abstractions go it is much more robust, flexible, and detached from the core runtime business logic than (ab)use of MediaWiki's hook mechanism for forking the log event stream.

Tgr added a comment.Wed, Feb 6, 11:12 PM

IMO Monolog, and event logging in general, is meant for collecting data. Much of EventBus is basically a webhook mechanism: it allows external services to react on MediaWiki state changes (e.g. purge data when the page was edited or deleted).

If data collection fails, that's bad but not critical. If (say) the page summary endpoint stops updating the summary, that's critical. So it's a good idea to treat logging and webhooks separately; it should be clear to programmers whether a call is an external integration point to MediaWiki or just data collection. Also log schemas are fluid while webhook schemas need to be handled as stable interfaces.

ApiAction (and from the sound of it CirrusSearchRequestSet, although I'm not familiar with that) is for data collection so using the generic data collection framework for it (and allowing third parties to collect the data in some other way) makes perfect sense IMO.

Nuria added a comment.Thu, Feb 7, 12:20 AM

+1 to @Tgr specially to "log schemas are fluid while webhook schemas need to be handled as stable interfaces." this is a key difference between plain out "logging" and events that carry state.

Logging -as @bd808 mentioned earlier- benefits from 'structure' so you can do more than "free text parsing". The structure however, does not need to be "versioned, fixed and backwards compatible" the structure of a logging message can be much more of a fluid setup, there are fundamental differences as to what data for each system is used for, event consumers expect reliability of data when it comes to a schema, logging consumers might not need to be so picky.

In this case waters are a bit muddled because no matter whether we use monolog as a transport or just a hook and eventbus the data for API logging is going to abide to a schema. That has more to do with the fact that we are not going to build a custom pipeline for collecting this data but given its consumers it really does not require a schema that is kept backwards compatible at all times.

If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

Ok, I back away from this one! I think what I meant was that I'd like something better and more generic than EventBus in general, and I was hoping Monolog could be it.

I would personally argue that structured logging is exactly the same thing as events.

The differentiation I make is that events are strongly typed (AKA schema-ed), whereas log messages are not. These log messages are structured, but there is nothing guaranteeing that a log message of specific kind will always look the same to downstream consumers.

My brain is boiling your argument down to a claim that if domain specific knowledge is needed for a storage destination then that precludes the use of generalized systems for delivery.

Hm, yes that was my argument, but maybe I'm thinking about it wrong. I thought that usually 'formatters' don't do much beside serializing the data in some specific way. If they are supposed to modify the $record['context'] given to them for the handler, then I think we could use formatters to format each event. Would we have to create a Formatter for each channel? If so, I guess this would be fine, as we basically have these 'formatters' already, except they are in the hook function code, e.g. https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/includes/EventBusHooks.php#L102-L113 We could make the EventBus Monolog handler use the proper formatter and schema based on the Channel name, and keep the 'mapping' there instead of in mw-config.

I'm still not sure if Monolog is the right fit, but adapting it and continuing to use it might be better than creating two new hooks. @Tgr do you have a preference?

Change 489949 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Set maxLength on patternlike fields in tewst/event/0.0.2

https://gerrit.wikimedia.org/r/489949

Change 489949 merged by Ottomata:
[mediawiki/event-schemas@master] Set maxLength on patternlike fields in tewst/event/0.0.2

https://gerrit.wikimedia.org/r/489949