See also T84958 where we want to figure out an ideal set of production hardware and setup for the logstash cluster. However logstash is chronically unhealthy and needs some help now. As a temporary stopgap we were thinking of throwing some of the now-decomissioned lsearchd (T85009) boxes at it. Yes they're out of warranty but this is just meant to hold us over in the meantime and they have a good amount of RAM which is what we need.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | RobH | T87031 Allocate a few servers to logstash | |||
Resolved | RobH | T86149 reclaim lsearchd hosts | |||
Resolved | • Cmjohnson | T92434 wipe search* and searchidx* hosts |
Event Timeline
So recently we expanded Logstash disk capacity and then determined that was all needed for now. Why has this changed?
The current nodes have insufficient RAM and Elasticsearch keeps OOMing. (Details: T84958: eqiad: (3) servers for logstash service)
I broke it. I got all the changes merged that were needed to aggregate MediaWiki logs via direct communication between MW and Redis queues on the Logstash cluster. This has somewhat unexpectedly greatly increased the log traffic that is actually seen by the Logstash cluster. The 2 udp hops that were used previously to get logs from MediaWiki to fluorine and then from fluorine to logstash apparently had a pretty high drop rate. Also for ~2 days we had duplicate log traffic (from both the udp relay and direct) for a large number of wikis. It is quite possible that since @Gage merged the patch to drop the udp relay traffic we will settle back down to something less crushing for the boxes.
We really won't know what the new log volume looks like until 2015-01-18T00:00Z. 2015-01-17 will be the first day that we have all redis MW traffic and no extra log2udp relay traffic. If we can limp along until then we should have a better idea of what's up.
If the ram from these is compatible with the logstash100[123] boxes we have then maybe T87078: Upgrade RAM for logstash100[123] to 64G can be done by just moving some sticks from one to another?
how urgent is this? well, when it says "chronically unhealthy" that sounds like "normal" is appropriate?
I think we should kill this request and just get the procurement ticket in for "real" hardware. @yuvipanda is working on that now I think but I don't know if the ticket exists yet. We got a little wild the week before last with dreaming up stop-gap solutions.