Evaluate traffic flow between the Jobrunners and the Cirrus cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Jul 13 2015, 4:29 PM

Description

We need to collect stats on how much data we have flowing between the JR and cirrus clusters to understand if we can send intra-datacenter the parsed data directly to ES or if we need to basically replicate the job in codfw.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Deskana	T105703 Set up a CirrusSearch cluster in codfw (Dallas, Texas)
Duplicate	None	T105709 Implement multi-DC support in CirrusSearch
Resolved	• chasemp	T105708 Decide on and document the implementation for multi data centre CirrusSearch
Resolved	• Gage	T105705 Evaluate traffic flow between the Jobrunners and the Cirrus cluster

Event Timeline

Joe created this task.Jul 13 2015, 4:29 PM

Joe raised the priority of this task from to High.

Joe updated the task description. (Show Details)

Joe added a project: acl*sre-team.

Joe added subscribers: • Gage, dcausse, Aklapper and 2 others.

• Gage claimed this task.Jul 13 2015, 4:42 PM

• Gage set Security to None.

• Manybubbles added a parent task: T105708: Decide on and document the implementation for multi data centre CirrusSearch.Jul 13 2015, 4:44 PM

Joe mentioned this in T105703: Set up a CirrusSearch cluster in codfw (Dallas, Texas).Jul 13 2015, 4:45 PM

TLDR: the rough estimate is about 32Mbit/sec from jobrunners to elasticsearch nodes. Traffic is bursty so I advise planning for a 50-60Mbit ceiling.

Details:
Joe mentioned using ngrep, but I didn't see a way to get stats from it so I used iftop instead. It measures traffic rates over 2, 10, and 40 second intervals. I first looked at Ganglia to determine load distribution within the clusters, then selected three nodes in each to check. Ganglia does not show a distinct daily traffic pattern for jobrunners. I performed my testing around 12:00-13:00 PDT (18:00-19:00 UTC).

Jobrunner side:
iftop -f "port 9200" shows the traffic to search.svc.eqiad.wmnet:9200. The traffic from JR->ES is ~2Mbit/sec per node, while the reverse flow is quite small, around 100kbit. ~2Mbit multiplied by 16 nodes gives ~32Mbit.

Elasticsearch side:
This is a little trickier to measure because we're receiving traffic through the load balancer from mw nodes which are jobrunners, as well as main app servers proxying searches from users. Luckily the jobrunners use a contiguous IP range 10.64.0.{31-46}, which I converted into CIDR netblocks for iftop's tcpdump-compatible filter syntax: iftop -f "net 10.64.0.31/32 or net 10.64.0.32/29 or net 10.64.0.40/30 or net 10.64.0.44/31 or net 10.64.0.46/32" shows traffic from all 16 jobrunners, with aggregate totals of 750-1250kbit/sec RX and ~55kbit/sec TX. ~1Mbit multiplied by 31 nodes gives ~31Mbit.

So these rough measurements from each side match up.

If we want a higher degree of confidence, we could use statsd or some iptables accounting rules to give us averages over longer time scales.

32 Mbit/s doesn't seem like something insane to stream between the two datacenters, IMO. I'll wait for confirmation from @faidon or @mark, as I guess we should reduce our bandwidth usage to the minimum by design whenever a data flux is potentially large.

That's noise, not a problem at all. :)

I don't think there is more to do on this task

Evaluate traffic flow between the Jobrunners and the Cirrus clusterClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Evaluate traffic flow between the Jobrunners and the Cirrus cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...