Page MenuHomePhabricator

Retire udp2log: onboard its producers and consumers to the logging pipeline
Open, MediumPublic

Description

We are deprecating udp2log in production, thus the current users should be migrated to the logging pipeline instead.

List of candidates for migration:

  • mediawiki
  • scap

The scap is easy (low volume) whereas for mediawiki we'll have to do some thinking because volume is significant (~18-20k msg/s) and udp2log output is consumed by multiple people on mwlog hosts, and the move should be transparent to them (i.e. change transport to be kafka but still write to files)

Event Timeline

herron triaged this task as Medium priority.Oct 2 2018, 5:24 PM
fgiunchedi renamed this task from Deprecate >= 50% of udp2log producers to Retire udp2log: onboard its producers and consumers to the logging pipeline.Jan 16 2019, 11:22 AM
fgiunchedi updated the task description. (Show Details)

This is the outline of the plan to move mediawiki logging off udp2log and logging pipeline's kafka (cc @bd808 @aaron @Ottomata)

Transport

A new rsyslog localhost udp endpoint is introduced on mediawiki hosts (e.g. named mwlog) that takes udp syslog and forwards it onto a set of kafka topics (separate from topics consumed by logstash). Messages are then consumed by mwlog hosts, via kafkatee or rsyslog, and written to per-channel files to keep compatibility with what we have now.
Using a separate endpoint and set of topics allows for some tuning / flexibility around the fact that udp2log stream is quite a bit bigger at the moment than the logstash stream: thus kafka topic retention, rate limits, etc for example would need to be different between the existing logstash topics.

Formatting

Log entry formatting is changed from "line" to syslog + json by MediaWiki before emitting to localhost, json is kept as the kafka message format as well. Upon being consumed by mwlog the json is then formatted as "lines" using the existing formatting found on mwlog hosts.

Open questions

  • The plan has syslog + json as formatting when transporting on kafka, since that's what we use for logstash already and preserves more information. Although we could have syslog + current formatting written to kafka?
  • The idea would be to name this "mwlog" (e.g configuration, topic names, etc) to steer away from "udp2log", seems sensible?

Implementation

The following patches are meant to implement the plan above, specifically:

The plan has syslog + json as formatting, since that's what we use for logstash already and preserves more information. Although we could have syslog + current formatting?

People (including @tstarling?) have advocated in the past for keeping the logs on mwlog1001 in the more human readable format. It might be a reasonable compromise to build a shell pipeline compatible utility that can be used to reformat JSON log event records for doing things like grep foo some.log | humanlog or tail -f some.log | humanlog. A utility like that could also pretty easily do simple extraction of particular elements of the json structure which might be easier to use than some of the typical awk magic that folks use when data mining from the files. I can dig up some python scripts I have written in the past to kickstart something like this.

It might be a reasonable compromise to build a shell pipeline compatible utility that can be used to reformat JSON log event records

kafkatee can do this, and was in fact built for it:

https://github.com/wikimedia/analytics-kafkatee/blob/master/kafkatee.conf.example#L129-L207

Qs:

Are the logs sent using Monolog?

Is there just one topic 'mwlog', or multiple, one per channel? I'm asking just in case we should consider using this rsyslog feature to log to Kafka via Monolog, rather than our effort in T216163: Add monolog adapters for Eventbus.

The plan has syslog + json as formatting, since that's what we use for logstash already and preserves more information. Although we could have syslog + current formatting?

People (including @tstarling?) have advocated in the past for keeping the logs on mwlog1001 in the more human readable format. It might be a reasonable compromise to build a shell pipeline compatible utility that can be used to reformat JSON log event records for doing things like grep foo some.log | humanlog or tail -f some.log | humanlog. A utility like that could also pretty easily do simple extraction of particular elements of the json structure which might be easier to use than some of the typical awk magic that folks use when data mining from the files. I can dig up some python scripts I have written in the past to kickstart something like this.

Thanks for the feedback!

I have edited my comment to clarify that the scope of json vs current formatting was limited to writing to kafka, at least in the first phase we're focusing on making the transport more reliable and secure but keep the format as-in on mwlog.

We would indeed need something to extract the right fields when reading from kafka and writing to files to keep the same human formatting as now so definitely we'd appreciate any help/tools towards that!

re: the open question itself I'm leaning towards having json on kafka, for multiple reasons: it makes kafka messages uniform (mw logstash logging is already json) and thus easier for consumers to have only one format, and we won't lose information/context compared to what mw knows about the log message.

Qs:

Are the logs sent using Monolog?

Is there just one topic 'mwlog', or multiple, one per channel? I'm asking just in case we should consider using this rsyslog feature to log to Kafka via Monolog, rather than our effort in T216163: Add monolog adapters for Eventbus.

There will be one topic per syslog severity, similarly to what's happening now for mediawiki logstash logging, we have considered doing one topic per channel but ultimately decided against it due to potential topic flooding, and it'd be a little fragile to reconstruct the channel name from the topic name (e.g. choose a suitable separator/prefix that can't be in a channel name)

re: the open question itself I'm leaning towards having json on kafka

Yes please!

There will be one topic per syslog severity [...]

Ok great. We are working on making Monolog be able to send events via EventBus extension, which will ultimately log to Kafka. The Monolog channels we are sending are really 'request logs' (which is a kind of event), but since it was logging I was wondering if we should consider using your rsyslog stuff to get this data to Kafka instead. If we don't have easily have control over topics, then we can rule out this option. Thanks!

re: the open question itself I'm leaning towards having json on kafka

Yes please!

There will be one topic per syslog severity [...]

Ok great. We are working on making Monolog be able to send events via EventBus extension, which will ultimately log to Kafka. The Monolog channels we are sending are really 'request logs' (which is a kind of event), but since it was logging I was wondering if we should consider using your rsyslog stuff to get this data to Kafka instead. If we don't have easily have control over topics, then we can rule out this option. Thanks!

ack, thanks for additional context!

Change 498106 merged by Filippo Giunchedi:
[mediawiki/core@master] monolog: add MwlogHandler

https://gerrit.wikimedia.org/r/498106

One year later, this class appears not to be used anywhere. Is it expected to become used or have plans changed?

Change 498106 merged by Filippo Giunchedi:
[mediawiki/core@master] monolog: add MwlogHandler

https://gerrit.wikimedia.org/r/498106

One year later, this class appears not to be used anywhere. Is it expected to become used or have plans changed?

The former, plan is still to move away from udp2log and onto Kafka / logging pipeline!

I'm confused.. I thought we were already on the Kafka pipeline with udp2log being legacy to phase out and build atop the same pipeline?

Is the intention to have MediaWiki format and dispatch every message twice, with two different rsyslog/kafka handlers?

Perhaps it would make sense to use only one. For things we don't want to ingest in Logstash, they can be dropped at in-take. E.g. MediaWiki can add "udponly: 1" or "logstash: no" or some such to the packet as-needed. The Monolog stack is fairly complex so not having to instantiate two of them long-term would be nice. I suppose it might also make the pipeline easier to reason about for developers in terms of consistency etc.

We are on the Kafka pipeline for MW logs that were sent to logstash over the network, udp2log is still in place due to the high volume of logs but yes eventually we'd like to deprecate udp2log too and move everything to Kafka.

In terms of processing I don't know ATM if logstash has enough capacity to ingest everything and drop unwanted messages. My initial thought was to write udp2log messages to a different set of Kafka topics and consume them from mwlog hosts with kafkacat.

re: having a single Monolog instance to handle all logs and tag e.g. logstash: no, where would that logic live in MW? IIRC now the udplog vs logstash switch is based on the severity e.g. all debug messages make it to udp2log