Page MenuHomePhabricator

Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate
Open, HighPublic8 Story Points

Description

A Modern Event Platform goal this quarter is deploy an EventGate service and produce the Monolog+Avro based events to it. We'll design new JSONSchema-ed events that represent the same data as the existent Avro one's. We'll then modify the EventBus extension to produce these events.

https://github.com/wikimedia/mediawiki-event-schemas/tree/master/avro/mediawiki

This ticket does not encompass replacing the Hadoop based processing of this data, just producing the new ones. We'll have another ticket for using the new data, and for decommissioning the Avro stuff.

Event Timeline

Ottomata created this task.Jan 17 2019, 8:20 PM
Ottomata triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2019, 8:20 PM
Ottomata moved this task from Backlog to Next Up on the EventBus board.Jan 17 2019, 9:28 PM
Pchelolo moved this task from Backlog to watching on the Services board.Jan 17 2019, 9:41 PM
Pchelolo edited projects, added Services (watching); removed Services.

I had in mind that one of the reason for using avro originally was data size. Currently the mediawiki_ApiAction takes ~1T per month and the mediawiki_CirrusSearchRequestSet ~2T per month. A small test with hive showed me that the growth factor from avro to json would be of ~4.5 for mediawiki_CirrusSearchRequestSet and ~3 for mediawiki_ApiAction (if the events don't change ). This would lead to ~12T monthly, which is largely acceptable for HDFS.
Last but not least: What about kafka?

Ottomata added a comment.EditedJan 22 2019, 3:29 PM

These events (for now) will go to the jumbo-eqiad Kafka cluster. 12T is not nothing! This cluster is currently handling around 64T / month total. There's about 78T free on that cluster now, so we aren't going to have a problem storing the data (for a week at at time). We'll have to see about throughput. 12T / month will add about 5M / second in (+out). I'd imagine this will be fine, but we'll have keep our eyes on it.

Change 485885 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] [WIP] Add mediawiki/search/requestset/0.0.1 schema

https://gerrit.wikimedia.org/r/485885

Change 485893 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] [WIP] Add mediawiki/api/request/0.0.1 schema

https://gerrit.wikimedia.org/r/485893

Change 487154 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add test/event/0.0.3 with test_map example

https://gerrit.wikimedia.org/r/487154

@bd808, let's discuss your Monolog idea from https://gerrit.wikimedia.org/r/487154 here.

I just brainstormed with @Pchelolo about this. If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

It gets a little tricky though. Monolog is a modular logging pipeline, but events !== log messages. It seems that one of the purposes of the abstractions of Monolog would be to be able to use any Channel with any Handler via configuration, e.g. ApiAction could be configured to be logged to files or to Kafka with Avro or to Event(Bus|Gate) via HTTP. However, this isn't quite true. In order to support Avro, you had to add extra custom configuration to map from channel names to Avro schemas AND the Avro schemas themselves had to be shipped with the code. This means that you can't use just any Channel with the Kafka+Avro Handler/Formatter. You couldn't just configure e.g. query logs to go to Kafka+Avro, you needed a schema and extra configs. (There's even a comment in mediawiki-config about how the extra avro schema configs don't belong there.)

The same applies to Event(Bus|Gate), except since we are using JSON, we don't need the actual schemas to format the message. We do need the schema URI though (e.g. /mediawiki/api/request/0.0.1) so that EventGate will know which schema the event should be validated against.

So, I'm not sure it is such a great fit. I'd love it if there was a more generic way than our one-off EventBus extension to send events like this, but I'm not sure what it is. @Pchelolo might have more to say too. :)

If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

It was not my impression that we wanted to generalize things like this. For 'main' events we would still be using hooks unless sending events will be integrated into the core mediawiki, quite like sending jobs is integrated. In my definition 'main' events is something which another non-analytics functionality depends on. The 'ApiAction' and 'CirrusSearch' events are in that sense, not 'main' - that's basically TRACE level logging that we are sending to a different processing pipeline (Kafka) since our main pipeline (Logstash) can not handle the load. In that sense, sending non-main events via monolog makes sense.

In the perfect world, using Monolog, we'd be able to instrument MW code by just adding a structured log to, for example, ApiQueryAction and everything would work out of the box, but in our current approach, we'd need to have a mapping 'logChannel -> (kafkaTopic, evenbusInstanse, schema)' and prior to enabling it we'd need to create a schema (for Hadoop refinery). I guess this could be an ok compromise since unless we integrate event sending to MW core, the only alternative is creating a quite nonsense hook.

bd808 added a comment.Feb 6 2019, 9:43 PM

@bd808, let's discuss your Monolog idea from https://gerrit.wikimedia.org/r/487154 here.

I just brainstormed with @Pchelolo about this. If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

I'm not quite sure where this conclusion comes from. I have not reviewed the entirety of the EventBus integration into MediaWiki core, but to me this would only be a obvious and necessary conclusion if all such structured events could reasonably be used by alternate event sinks. (More on this below.)

It gets a little tricky though. Monolog is a modular logging pipeline, but events !== log messages.

I would personally argue that structured logging is exactly the same thing as events. I actually call them "log events" rather than "log messages" exactly because using structure is meant to make operational logging something that can be readily processed programmatically rather than locking up the data and metadata in plain text files where herculean efforts with regular expressions need to be made to make the logs usable for more than light afternoon reading.

It seems that one of the purposes of the abstractions of Monolog would be to be able to use any Channel with any Handler via configuration, e.g. ApiAction could be configured to be logged to files or to Kafka with Avro or to Event(Bus|Gate) via HTTP. However, this isn't quite true. In order to support Avro, you had to add extra custom configuration to map from channel names to Avro schemas AND the Avro schemas themselves had to be shipped with the code. This means that you can't use just any Channel with the Kafka+Avro Handler/Formatter. You couldn't just configure e.g. query logs to go to Kafka+Avro, you needed a schema and extra configs. (There's even a comment in mediawiki-config about how the extra avro schema configs don't belong there.)

The same applies to Event(Bus|Gate), except since we are using JSON, we don't need the actual schemas to format the message. We do need the schema URI though (e.g. /mediawiki/api/request/0.0.1) so that EventGate will know which schema the event should be validated against.

My brain is boiling your argument down to a claim that if domain specific knowledge is needed for a storage destination then that precludes the use of generalized systems for delivery. I may very well be missing some subtlety in this distillation, but at the moment I'm going to assume that it is broadly correct.

The Monolog pipeline is designed to detach log event production from log event formatting and log event storage. It uses the pipeline pattern which is a key component of Unix system design. The event source is the use of the PSR-3 logger interface which is agnostic of the actual PSR-3 implementation in use. When that implementation is Monolog, processors can be used to augment events, and a combination of formatters and handlers can be used to route each event produced to a logger to 0-N event sinks. Many (most? all?) formatter+handler pairs do something transformative to the event to fit the expectations of the event sink or its consumers. To my mind this is the "right" way to treat log events that have a number of potential consumers. I would agree that it is likely over-designed for a bespoke system that will only ever be put to a narrow task, but as far as general purpose abstractions go it is much more robust, flexible, and detached from the core runtime business logic than (ab)use of MediaWiki's hook mechanism for forking the log event stream.

Tgr added a comment.Feb 6 2019, 11:12 PM

IMO Monolog, and event logging in general, is meant for collecting data. Much of EventBus is basically a webhook mechanism: it allows external services to react on MediaWiki state changes (e.g. purge data when the page was edited or deleted).

If data collection fails, that's bad but not critical. If (say) the page summary endpoint stops updating the summary, that's critical. So it's a good idea to treat logging and webhooks separately; it should be clear to programmers whether a call is an external integration point to MediaWiki or just data collection. Also log schemas are fluid while webhook schemas need to be handled as stable interfaces.

ApiAction (and from the sound of it CirrusSearchRequestSet, although I'm not familiar with that) is for data collection so using the generic data collection framework for it (and allowing third parties to collect the data in some other way) makes perfect sense IMO.

Nuria added a comment.Feb 7 2019, 12:20 AM

+1 to @Tgr specially to "log schemas are fluid while webhook schemas need to be handled as stable interfaces." this is a key difference between plain out "logging" and events that carry state.

Logging -as @bd808 mentioned earlier- benefits from 'structure' so you can do more than "free text parsing". The structure however, does not need to be "versioned, fixed and backwards compatible" the structure of a logging message can be much more of a fluid setup, there are fundamental differences as to what data for each system is used for, event consumers expect reliability of data when it comes to a schema, logging consumers might not need to be so picky.

In this case waters are a bit muddled because no matter whether we use monolog as a transport or just a hook and eventbus the data for API logging is going to abide to a schema. That has more to do with the fact that we are not going to build a custom pipeline for collecting this data but given its consumers it really does not require a schema that is kept backwards compatible at all times.

If we were to use Monolog, we'd likely want to do it with the aim of converting all existent EventBus events to use it as well.

Ok, I back away from this one! I think what I meant was that I'd like something better and more generic than EventBus in general, and I was hoping Monolog could be it.

I would personally argue that structured logging is exactly the same thing as events.

The differentiation I make is that events are strongly typed (AKA schema-ed), whereas log messages are not. These log messages are structured, but there is nothing guaranteeing that a log message of specific kind will always look the same to downstream consumers.

My brain is boiling your argument down to a claim that if domain specific knowledge is needed for a storage destination then that precludes the use of generalized systems for delivery.

Hm, yes that was my argument, but maybe I'm thinking about it wrong. I thought that usually 'formatters' don't do much beside serializing the data in some specific way. If they are supposed to modify the $record['context'] given to them for the handler, then I think we could use formatters to format each event. Would we have to create a Formatter for each channel? If so, I guess this would be fine, as we basically have these 'formatters' already, except they are in the hook function code, e.g. https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/includes/EventBusHooks.php#L102-L113 We could make the EventBus Monolog handler use the proper formatter and schema based on the Channel name, and keep the 'mapping' there instead of in mw-config.

I'm still not sure if Monolog is the right fit, but adapting it and continuing to use it might be better than creating two new hooks. @Tgr do you have a preference?

Change 489949 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Set maxLength on patternlike fields in tewst/event/0.0.2

https://gerrit.wikimedia.org/r/489949

Change 489949 merged by Ottomata:
[mediawiki/event-schemas@master] Set maxLength on patternlike fields in tewst/event/0.0.2

https://gerrit.wikimedia.org/r/489949

Ottomata raised the priority of this task from Normal to High.Feb 21 2019, 5:53 PM

Change 485893 merged by jenkins-bot:
[mediawiki/event-schemas@master] Add mediawiki/api/request/0.0.1 schema

https://gerrit.wikimedia.org/r/485893

Change 492923 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add maxLength for meta.uri in api/request schema

https://gerrit.wikimedia.org/r/492923

Change 492923 merged by Ottomata:
[mediawiki/event-schemas@master] Add maxLength for meta.uri in api/request schema

https://gerrit.wikimedia.org/r/492923

Change 492925 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Use schemas from docker image, configure api-request stream

https://gerrit.wikimedia.org/r/492925

Change 492925 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Use schemas from docker image, configure api-request stream

https://gerrit.wikimedia.org/r/492925

Change 491887 merged by Ottomata:
[mediawiki/core@master] Emit new style API action logs into Monolog.

https://gerrit.wikimedia.org/r/491887

Change 494305 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventBus@master] Send monolog $record['context'] as event, not entire $record

https://gerrit.wikimedia.org/r/494305

Change 494308 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/core@master] Set api-request log http.request_headers properly

https://gerrit.wikimedia.org/r/494308

Change 494308 merged by Ottomata:
[mediawiki/core@master] Set api-request log http.request_headers properly

https://gerrit.wikimedia.org/r/494308

Change 494305 merged by Ottomata:
[mediawiki/extensions/EventBus@master] Send monolog $record['context'] as event, not entire $record

https://gerrit.wikimedia.org/r/494305

Change 494368 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventBus@wmf/1.33.0-wmf.19] Send monolog $record['context'] as event, not entire $record

https://gerrit.wikimedia.org/r/494368

Ottomata moved this task from Next Up to In Progress on the EventBus board.Mar 4 2019, 10:29 PM
Ottomata added a project: Analytics-Kanban.
Ottomata moved this task from Next Up to In Code Review on the Analytics-Kanban board.

@akosiaris, I deployed the new 0.0.7 eventgate-analytics chart to prod k8s and did an ab test there. I can currently push around 800 events per second on our single instance[1]. Let's be conservative and assume 500. By the end of the quarter we'll move both of the topics still used in the Kafka analytics cluster to EventGate + kafka jumbo.

https://grafana.wikimedia.org/d/000000524/kafka-by-topic-graphite?refresh=5m&orgId=1

So, ~10K / second. Is 20 eventgate-analytics pods too much to ask?

[1] From deploy1001.eqiad.wmnet:

ab -n4000 -c400 -T 'application/json' -p /home/otto/test_event_0.0.2.json http://eventgate-analytics.discovery.wmnet:31192/v1/events
...
Requests per second:    808.79 [#/sec] (mean)

@akosiaris, I deployed the new 0.0.7 eventgate-analytics chart to prod k8s and did an ab test there. I can currently push around 800 events per second on our single instance[1]. Let's be conservative and assume 500. By the end of the quarter we'll move both of the topics still used in the Kafka analytics cluster to EventGate + kafka jumbo.

https://grafana.wikimedia.org/d/000000524/kafka-by-topic-graphite?refresh=5m&orgId=1

So, ~10K / second. Is 20 eventgate-analytics pods too much to ask?

[1] From deploy1001.eqiad.wmnet:

ab -n4000 -c400 -T 'application/json' -p /home/otto/test_event_0.0.2.json http://eventgate-analytics.discovery.wmnet:31192/v1/events
...
Requests per second:    808.79 [#/sec] (mean)

That's totally fine.

Awesome, with 20 replicas I can make a single ab with -c1000 push ~6400 msgs/second, but then was limited by single node ab to push farther. :)

Change 494368 abandoned by Ottomata:
Send monolog $record['context'] as event, not entire $record

Reason:
This will go out with wmf.20 train.

https://gerrit.wikimedia.org/r/494368

Change 495144 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/CirrusSearch@master] [WIP] Log search/requestset event

https://gerrit.wikimedia.org/r/495144

Change 495399 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/core@master] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/495399

Change 495399 merged by Ottomata:
[mediawiki/core@master] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/495399

Change 496197 had a related patch set uploaded (by Addshore; owner: Ottomata):
[mediawiki/core@wmf/1.33.0-wmf.21] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/496197

Change 496197 merged by Addshore:
[mediawiki/core@wmf/1.33.0-wmf.21] ApiMain.php api/request logging event changes

https://gerrit.wikimedia.org/r/496197

Mentioned in SAL (#wikimedia-operations) [2019-03-13T16:36:32Z] <addshore@deploy1001> Synchronized php-1.33.0-wmf.21/includes/api/ApiMain.php: SWAT: T214080 T212529 ApiMain.php api/request logging event changes [[gerrit:496197]] (duration: 00m 57s)

Change 496235 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Enable api-request logging into kafka for group1

https://gerrit.wikimedia.org/r/496235

Change 496235 merged by Ottomata:
[operations/mediawiki-config@master] Enable api-request logging into kafka for group1

https://gerrit.wikimedia.org/r/496235

Change 498873 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Re-enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/498873

Change 498873 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/498873

Change 502292 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/502292

Change 502292 merged by Ottomata:
[operations/mediawiki-config@master] Enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/502292

Mentioned in SAL (#wikimedia-operations) [2019-04-08T18:27:07Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Enable eventgate-analytics api-request logging for group0 wikis - T214080 (duration: 00m 56s)

Change 485885 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add mediawiki/cirrussearch/request/0.0.1 schema

https://gerrit.wikimedia.org/r/485885

Ok, I think https://gerrit.wikimedia.org/r/#/c/mediawiki/event-schemas/+/485885/ and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/495144/ are good to go.

We renamed search/requestset to cirrussearch/request, as that is more appropriate. I also renamed the sub elasticsearch requests named appropriately.

Example mediawiki/cirrussearch/request event:

{
  "$schema": "/mediawiki/cirrussearch/request/0.0.1",
  "meta": {
    "request_id": "1e12a388360c781ac70aa631",
    "id": "bedfc021-51ae-11e9-98f7-0800278dc04d",
    "dt": "2019-03-28T23:10:58Z",
    "domain": "dev.wiki.local.wmftest.net",
    "stream": "mediawiki.cirrussearch-request"
  },
  "http": {
    "method": "GET",
    "client_ip": "10.0.2.2",
    "has_cookies": true,
    "request_headers": {
      "accept-language": "en-US,en;q=0.5",
      "referer": "http://dev.wiki.local.wmftest.net:8080/w/index.php?search=sdf&title=Special%3ASearch&go=Go",
      "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"
    }
  },
  "database": "wiki",
  "mediawiki_host": "mws",
  "search_token": "80ewxp4pl6t7vev6m8x99j21c",
  "source": "web",
  "identity": "fe15eff974aa3371f119f00167c5cd09",
  "request_time_ms": 1114,
  "all_elasticsearch_requests_cached": false,
  "hits": [
    {
      "page_title": "Sandbox",
      "page_id": 6,
      "index": "wiki_content_first",
      "score": 0.036319096
    }
  ],
  "elasticsearch_requests": [
    {
      "query": "sdf",
      "query_type": "near_match",
      "indices": [
        "wiki_content"
      ],
      "request_time_ms": 39,
      "search_time_ms": 2,
      "limit": 50,
      "hits_total": 0,
      "hits_returned": 0,
      "hits_offset": 0,
      "suggestion_requested": false,
      "syntax": [
        "near_match"
      ],
      "cached": false
    },
    {
      "query": "sdf",
      "query_type": "full_text",
      "indices": [
        "wiki_content"
      ],
      "request_time_ms": 35,
      "search_time_ms": 16,
      "limit": 21,
      "hits_total": 1,
      "hits_returned": 1,
      "hits_offset": 0,
      "suggestion_requested": true,
      "max_score": 0.036319096,
      "syntax": [
        "full_text",
        "full_text_simple_match"
      ],
      "cached": false,
      "hits": [
        {
          "page_title": "Sandbox",
          "page_id": 6,
          "index": "wiki_content_first",
          "score": 0.036319096
        }
      ]
    }
  ]
}

@EBernhardson @Pchelolo lookin good ya? Any obvious changes needed?

Change 502838 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Enable api-request EventGate logging for group1 wikis

https://gerrit.wikimedia.org/r/502838

Change 502838 merged by Ottomata:
[operations/mediawiki-config@master] Enable api-request EventGate logging for group1 wikis

https://gerrit.wikimedia.org/r/502838

Mentioned in SAL (#wikimedia-operations) [2019-04-10T19:48:49Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Enabling api-request logging via eventgate-analytics for group1 wikis - T214080 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-04-10T20:27:43Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Revert - Enabling api-request logging via eventgate-analytics for group1 wikis - T214080 (duration: 01m 00s)

We are good to go on the CirrusSearch php extension change. I'd like to merge both patches, test in beta, and then let this go out with the train next week. Once the code is out we can start slowly enabling it via config.

@Pchelolo ok to merge https://gerrit.wikimedia.org/r/#/c/mediawiki/event-schemas/+/485885/ and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/495144/ ?