Allocate a few servers to logstash
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	• demon
	Jan 16 2015, 4:50 PM

Description

See also T84958 where we want to figure out an ideal set of production hardware and setup for the logstash cluster. However logstash is chronically unhealthy and needs some help now. As a temporary stopgap we were thinking of throwing some of the now-decomissioned lsearchd (T85009) boxes at it. Yes they're out of warranty but this is just meant to hold us over in the meantime and they have a good amount of RAM which is what we need.

Related Objects
Search...

Status	Assigned	Task
Declined	RobH	T87031 Allocate a few servers to logstash
Resolved	RobH	T86149 reclaim lsearchd hosts
Resolved	• Cmjohnson	T92434 wipe search* and searchidx* hosts

Event Timeline

• demon created this task.Jan 16 2015, 4:50 PM

• demon raised the priority of this task from to Needs Triage.

• demon updated the task description. (Show Details)

• demon added projects: acl*sre-team, ops-core.

• demon added subscribers: • demon, bd808, • Gage.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2015, 4:50 PM

• demon added a subtask: T86149: reclaim lsearchd hosts.Jan 16 2015, 4:52 PM

So recently we expanded Logstash disk capacity and then determined that was all needed for now. Why has this changed?

The current nodes have insufficient RAM and Elasticsearch keeps OOMing. (Details: T84958: eqiad: (3) servers for logstash service)

In T87031#982135, @mark wrote:

So recently we expanded Logstash disk capacity and then determined that was all needed for now. Why has this changed?

I broke it. I got all the changes merged that were needed to aggregate MediaWiki logs via direct communication between MW and Redis queues on the Logstash cluster. This has somewhat unexpectedly greatly increased the log traffic that is actually seen by the Logstash cluster. The 2 udp hops that were used previously to get logs from MediaWiki to fluorine and then from fluorine to logstash apparently had a pretty high drop rate. Also for ~2 days we had duplicate log traffic (from both the udp relay and direct) for a large number of wikis. It is quite possible that since @Gage merged the patch to drop the udp relay traffic we will settle back down to something less crushing for the boxes.

We really won't know what the new log volume looks like until 2015-01-18T00:00Z. 2015-01-17 will be the first day that we have all redis MW traffic and no extra log2udp relay traffic. If we can limp along until then we should have a better idea of what's up.

If the ram from these is compatible with the logstash100[123] boxes we have then maybe T87078: Upgrade RAM for logstash100[123] to 64G can be done by just moving some sticks from one to another?

how urgent is this? well, when it says "chronically unhealthy" that sounds like "normal" is appropriate?

I think we should kill this request and just get the procurement ticket in for "real" hardware. @yuvipanda is working on that now I think but I don't know if the ticket exists yet. We got a little wild the week before last with dreaming up stop-gap solutions.

This is now outdated, as stated, since task T84958 covers the hardware order.

RobH closed subtask T86149: reclaim lsearchd hosts as Resolved.Mar 16 2015, 9:02 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:55 PM

Allocate a few servers to logstashClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Allocate a few servers to logstash
Closed, DeclinedPublic
Actions

Related Objects
Search...