Page MenuHomePhabricator

Retire udp2log: onboard its producers and consumers to the logging pipeline
Open, NormalPublic

Description

We are deprecating udp2log in production, thus the current users should be migrated to the logging pipeline instead.

List of candidates for migration:

  • mediawiki
  • scap
  • iegreview
  • scholarships

The last three are easy (low volume) whereas for mediawiki we'll have to do some thinking because volume is significant (~18-20k msg/s) and udp2log output is consumed by multiple people on mwlog hosts, and the move should be transparent to them (i.e. change transport to be kafka but still write to files)

Event Timeline

fgiunchedi updated the task description. (Show Details)Oct 1 2018, 1:10 PM
herron triaged this task as Normal priority.Oct 2 2018, 5:24 PM
fgiunchedi moved this task from In Dev/Progress to Up next on the Wikimedia-Logstash board.
fgiunchedi renamed this task from Deprecate >= 50% of udp2log producers to Retire udp2log: onboard its producers and consumers to the logging pipeline.Jan 16 2019, 11:22 AM
fgiunchedi updated the task description. (Show Details)
fgiunchedi added subscribers: aaron, Ottomata, bd808.EditedFeb 15 2019, 4:08 PM

This is the outline of the plan to move mediawiki logging off udp2log and logging pipeline's kafka (cc @bd808 @aaron @Ottomata)

Transport

A new rsyslog localhost udp endpoint is introduced on mediawiki hosts (e.g. named mwlog) that takes udp syslog and forwards it onto a set of kafka topics (separate from topics consumed by logstash). Messages are then consumed by mwlog hosts, via kafkatee or rsyslog, and written to per-channel files to keep compatibility with what we have now.
Using a separate endpoint and set of topics allows for some tuning / flexibility around the fact that udp2log stream is quite a bit bigger at the moment than the logstash stream: thus kafka topic retention, rate limits, etc for example would need to be different between the existing logstash topics.

Formatting

Log entry formatting is changed from "line" to syslog + json by MediaWiki before emitting to localhost, json is kept as the kafka message format as well. Upon being consumed by mwlog the json is then formatted as "lines" using the existing formatting found on mwlog hosts.

Open questions

  • The plan has syslog + json as formatting when transporting on kafka, since that's what we use for logstash already and preserves more information. Although we could have syslog + current formatting written to kafka?
  • The idea would be to name this "mwlog" (e.g configuration, topic names, etc) to steer away from "udp2log", seems sensible?

Implementation

The following patches are meant to implement the plan above, specifically:

CDanis added a subscriber: CDanis.Feb 15 2019, 4:18 PM

The plan has syslog + json as formatting, since that's what we use for logstash already and preserves more information. Although we could have syslog + current formatting?

People (including @tstarling?) have advocated in the past for keeping the logs on mwlog1001 in the more human readable format. It might be a reasonable compromise to build a shell pipeline compatible utility that can be used to reformat JSON log event records for doing things like grep foo some.log | humanlog or tail -f some.log | humanlog. A utility like that could also pretty easily do simple extraction of particular elements of the json structure which might be easier to use than some of the typical awk magic that folks use when data mining from the files. I can dig up some python scripts I have written in the past to kickstart something like this.

It might be a reasonable compromise to build a shell pipeline compatible utility that can be used to reformat JSON log event records

kafkatee can do this, and was in fact built for it:

https://github.com/wikimedia/analytics-kafkatee/blob/master/kafkatee.conf.example#L129-L207

Qs:

Are the logs sent using Monolog?

Is there just one topic 'mwlog', or multiple, one per channel? I'm asking just in case we should consider using this rsyslog feature to log to Kafka via Monolog, rather than our effort in T216163: Add monolog adapters for Eventbus.

The plan has syslog + json as formatting, since that's what we use for logstash already and preserves more information. Although we could have syslog + current formatting?

People (including @tstarling?) have advocated in the past for keeping the logs on mwlog1001 in the more human readable format. It might be a reasonable compromise to build a shell pipeline compatible utility that can be used to reformat JSON log event records for doing things like grep foo some.log | humanlog or tail -f some.log | humanlog. A utility like that could also pretty easily do simple extraction of particular elements of the json structure which might be easier to use than some of the typical awk magic that folks use when data mining from the files. I can dig up some python scripts I have written in the past to kickstart something like this.

Thanks for the feedback!

I have edited my comment to clarify that the scope of json vs current formatting was limited to writing to kafka, at least in the first phase we're focusing on making the transport more reliable and secure but keep the format as-in on mwlog.

We would indeed need something to extract the right fields when reading from kafka and writing to files to keep the same human formatting as now so definitely we'd appreciate any help/tools towards that!

re: the open question itself I'm leaning towards having json on kafka, for multiple reasons: it makes kafka messages uniform (mw logstash logging is already json) and thus easier for consumers to have only one format, and we won't lose information/context compared to what mw knows about the log message.

Qs:

Are the logs sent using Monolog?

Is there just one topic 'mwlog', or multiple, one per channel? I'm asking just in case we should consider using this rsyslog feature to log to Kafka via Monolog, rather than our effort in T216163: Add monolog adapters for Eventbus.

There will be one topic per syslog severity, similarly to what's happening now for mediawiki logstash logging, we have considered doing one topic per channel but ultimately decided against it due to potential topic flooding, and it'd be a little fragile to reconstruct the channel name from the topic name (e.g. choose a suitable separator/prefix that can't be in a channel name)

re: the open question itself I'm leaning towards having json on kafka

Yes please!

There will be one topic per syslog severity [...]

Ok great. We are working on making Monolog be able to send events via EventBus extension, which will ultimately log to Kafka. The Monolog channels we are sending are really 'request logs' (which is a kind of event), but since it was logging I was wondering if we should consider using your rsyslog stuff to get this data to Kafka instead. If we don't have easily have control over topics, then we can rule out this option. Thanks!

re: the open question itself I'm leaning towards having json on kafka

Yes please!

There will be one topic per syslog severity [...]

Ok great. We are working on making Monolog be able to send events via EventBus extension, which will ultimately log to Kafka. The Monolog channels we are sending are really 'request logs' (which is a kind of event), but since it was logging I was wondering if we should consider using your rsyslog stuff to get this data to Kafka instead. If we don't have easily have control over topics, then we can rule out this option. Thanks!

ack, thanks for additional context!