Page MenuHomePhabricator

Allocate a few servers to logstash
Closed, DeclinedPublic

Description

See also T84958 where we want to figure out an ideal set of production hardware and setup for the logstash cluster. However logstash is chronically unhealthy and needs some help now. As a temporary stopgap we were thinking of throwing some of the now-decomissioned lsearchd (T85009) boxes at it. Yes they're out of warranty but this is just meant to hold us over in the meantime and they have a good amount of RAM which is what we need.

Event Timeline

demon raised the priority of this task from to Needs Triage.
demon updated the task description. (Show Details)
demon added projects: acl*sre-team, ops-core.
demon added subscribers: demon, bd808, Gage.

So recently we expanded Logstash disk capacity and then determined that was all needed for now. Why has this changed?

The current nodes have insufficient RAM and Elasticsearch keeps OOMing. (Details: T84958: eqiad: (3) servers for logstash service)

In T87031#982135, @mark wrote:

So recently we expanded Logstash disk capacity and then determined that was all needed for now. Why has this changed?

I broke it. I got all the changes merged that were needed to aggregate MediaWiki logs via direct communication between MW and Redis queues on the Logstash cluster. This has somewhat unexpectedly greatly increased the log traffic that is actually seen by the Logstash cluster. The 2 udp hops that were used previously to get logs from MediaWiki to fluorine and then from fluorine to logstash apparently had a pretty high drop rate. Also for ~2 days we had duplicate log traffic (from both the udp relay and direct) for a large number of wikis. It is quite possible that since @Gage merged the patch to drop the udp relay traffic we will settle back down to something less crushing for the boxes.

We really won't know what the new log volume looks like until 2015-01-18T00:00Z. 2015-01-17 will be the first day that we have all redis MW traffic and no extra log2udp relay traffic. If we can limp along until then we should have a better idea of what's up.

If the ram from these is compatible with the logstash100[123] boxes we have then maybe T87078: Upgrade RAM for logstash100[123] to 64G can be done by just moving some sticks from one to another?

Dzahn triaged this task as Medium priority.Jan 28 2015, 6:01 PM
Dzahn subscribed.

how urgent is this? well, when it says "chronically unhealthy" that sounds like "normal" is appropriate?

I think we should kill this request and just get the procurement ticket in for "real" hardware. @yuvipanda is working on that now I think but I don't know if the ticket exists yet. We got a little wild the week before last with dreaming up stop-gap solutions.

RobH claimed this task.
RobH subscribed.

This is now outdated, as stated, since task T84958 covers the hardware order.