Improve HA for logstash cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Feb 21 2014, 11:00 PM

Description

Before we can really start to rely on logstash it needs to have some work done to ensure that log events from the various input systems can reach the cluster via reliable transport and that various logstash nodes can consume that input.

In the current udp2log relay setup we are really only using the logstash1001 instance to process all incoming logs. Any time this node is restarted all log events are lost until it comes back up (2-3 minutes).

Version: wmf-deployment
Severity: major

Details

Reference: bz61785

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T198753 Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal)
Resolved	fgiunchedi	T198754 Logstash/Kibana architecture review
Resolved	fgiunchedi	T63785 Improve HA for logstash cluster

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 3:03 AM

• bzimport added a project: Wikimedia-Logstash.

• bzimport set Reference to bz61785.

• bzimport added a subscriber: Unknown Object (MLST).

bd808 created this task.Feb 21 2014, 11:00 PM

<cool_aid_advertisement>
Why not use Kafka as the messaging bus? That would solve all your reliability / durability concerns, it's operated by Ops for the Analytics team so it builds on existing infrastructure and there seems to be producer/consumer for logstash available at https://github.com/joekiller/logstash-kafka (hahaha more debianization fun)
</cool_aid_advertisement>

• Gage merged a task: Restricted Task.Dec 18 2014, 6:25 PM

• Gage added subscribers: Aklapper, • demon, • Gage, • GWicke.

Hadoop uses Logstash-gelf.jar (https://github.com/mp911de/logstash-gelf) which supports Redis but not Kafka. Though maybe we could submit a patch...

Just as a historical point of interest related to this feature request, we briefly switched MediaWiki logging to ship to Logstash via Redis and found that it was slowing things down measurably from the MediaWiki side.

UDP seems to be an ok transport for debug log messages, but it would be nice to have a more robust way to distribute the traffic across multiple Logstash backends. MediaWiki has switched from udp2log to using syslog udp datagrams as the transport mechanism. It is also configured to randomly select one of the Logstash servers for all log events from a given request.

Tgr mentioned this in T157396: HHVM fills logstash with junk (due to not handling multiline errors?).Feb 7 2017, 12:05 AM

herron added a parent task: T198754: Logstash/Kibana architecture review.Jul 5 2018, 5:22 PM

fgiunchedi moved this task from Backlog to Up next on the Wikimedia-Logstash board.Aug 6 2018, 1:13 PM

Resolving in favor of T205850: Procure and provision Logging pipeline hardware in multiple datacenters and related tasks.

fgiunchedi closed this task as Resolved.Oct 1 2018, 1:12 PM

fgiunchedi claimed this task.

fgiunchedi added a project: observability.Aug 19 2019, 2:29 PM