Finish conversion to multiple Cassandra instances per hardware node
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Apr 7 2015, 1:11 AM

Description

As discussed in T93790, we should be able to drive our hardware harder while maintaining low latencies by switching to multiple Cassandra instances per hardware node.

Implementation options:

multiple init scripts or systemd instances
Firejail: lighter-weight than docker, and more focus on security.
Docker containers.

In all cases, we need a puppet-generated cassandra.yaml file per instance. Another consideration is that we normally don't want to start cassandra on boot. This is to make sure that a node that was down for a long time does not join the cluster before the config is updated.

Details

Subject	Repo	Branch	Lines +/-
enable restbase2006-b.codfw.wmnet	operations/puppet	production	+5 -5
cassandra: add restbase2006 instances	operations/puppet	production	+16 -1
cassandra: add restbase2005 instances	operations/puppet	production	+16 -1
cassandra: provision restbase1009 with 128 tokens	operations/puppet	production	+1 -0
cassandra: add restbase2001 instance	operations/puppet	production	+6 -1
xenon additional instances	operations/dns	master	+4 -0
WIP: xenon additional instances	operations/puppet	production	+14 -3
cassandra: add eqiad test cluster multiple instances	operations/dns	master	+12 -1
cassandra: WIP support for multiple instances	operations/puppet	production	+374 -157
codfw: add test cassandra instances	operations/dns	master	+11 -3
cassandra: add restbase-test2001 instances	operations/puppet	production	+14 -0
cassandra: enable multi-instance	operations/puppet	production	+35 -146
cassandra: add multi-instance support, disabled	operations/puppet	production	+302 -24
restbase-test2001 additional cassandra instances	operations/dns	master	+4 -0

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T93751 RFC: Next steps for long-term revision storage -- space needs, storage hierarchies
Resolved	RobH	T93790 Expand RESTBase cluster capacity
Resolved	fgiunchedi	T108306 better cassandra process checks
Resolved	Eevans	T106619 investigate G1GC pause times
Resolved	fgiunchedi	T95253 Finish conversion to multiple Cassandra instances per hardware node
Resolved	fgiunchedi	T113733 column family cassandra metrics size
Resolved	fgiunchedi	T113939 assess impact of many cassandra seed nodes with multi instance
Resolved	Eevans	T117114 Ensure ansible-deploy can cope with multi-instance restarts
Resolved	Eevans	T121535 Perform cleanups to reclaim space from recent topology changes
Resolved	Eevans	T130540 Figure out if nodes in different DCs can be bootstrapped in parallel

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@GWicke 1007/1008/1009 have been provisioned on purpose as proportionally smaller machines, we could add ssd there but we might just run into the same problem upon decommissioning a bigger node with disks being more full... I suggest we try culling revisions from restbase too to get data sizes down

The basic issue is that 1007-9 are of a smaller spec for historical reasons, but yet have the same storage weights assigned to them. This makes them likely to run out of disk space. They also generally lag the others in performance. We should upgrade those nodes to match other nodes in the cluster (and have planned for this option when buying them), but this will take some time.

In the meantime, a way to move forward to multi instances with the existing hardware is to adjust the weights to be in line with the actual hardware capacities. Concrete steps:

Sequentially, convert 1007-9 to a multi-instance setup (decommission & re-bootstrap with num_tokens = 128 set in cassandra.yaml), with a single instance on each. In line with the num_tokens ratio vs. the default of 256, the new instances on 1007-9 will have about 1/2 the load of instances on 1001-6 assigned to them.

Convert other nodes to a multi-instance setup, using the regular num_tokens of 256. This will temporarily move 2/3 of the data to the bigger node in the same rack, and 1/3 to one of 1007-9.

Add more instances on all nodes, using 256 tokens per instance & different number of instances to balance load until hardware has been equalized.

I would propose to start this ASAP. If storage is still too tight then we can delete data, but lets first address the clear mismatch between current storage load assignments and actual hardware capacity.

@GWicke the plan makes sense I think, I can start right away and there should space on the rest to accomodate the decomissioning.

re: deleting data, when we need to do that how long would it take start to finish? e.g. if we run very tight (or out) of space

re: deleting data, when we need to do that how long would it take start to finish? e.g. if we run very tight (or out) of space

We have a script to selectively thin out old revisions. A full pass of this script took 1-2 days the last time we ran it. For the full storage savings to kick in after this, we'll then need to wait for compaction to free the data, which tends to take a few more days.

Per IRC conversation with @fgiunchedi and @Eevans I kicked off the decommission on restbase1009. It is streaming to 1005 and 1006 in the same rack, which are both relatively empty.

decommissioning still going

Sending 686 files, 616799776807 bytes total. Already sent 385 files, 263287846227 bytes total
Sending 702 files, 565071249693 bytes total. Already sent 123 files, 274783542790 bytes total

there's also a openjdk security update pending, I'm coordinating @Muehlenhoff to not restart nodes in that rack until decommissioning is done

The decommission is now done.

It might be worth going with XFS for the reimage, as discussed in T120004.

Change 256690 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: provision restbase1009 with 128 tokens

https://gerrit.wikimedia.org/r/256690

Change 256690 merged by Filippo Giunchedi:
cassandra: provision restbase1009 with 128 tokens

https://gerrit.wikimedia.org/r/256690

restbase1009 is bootstrapping, started at 09:36

finshed dec 05 10:41 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on port 9042`

restbase1008-a bootstrapping

Receiving 818 files, 410.89 GB total. Already received 23 files, 608.11 MB total
Receiving 705 files, 304.53 GB total. Already received 25 files, 630.8 MB total

• GWicke mentioned this in T121293: Pre-generate mobile app content end points.Dec 12 2015, 12:23 AM

• GWicke mentioned this in T121575: Expand SSD space in Cassandra cluster.Dec 15 2015, 8:54 PM

• GWicke mentioned this in Unknown Object (Task).Dec 17 2015, 8:37 PM

1004 finished decommissioning yesterday.

Eevans closed subtask T121535: Perform cleanups to reclaim space from recent topology changes as Resolved.Dec 21 2015, 5:25 PM

Eevans added a parent task: T106619: investigate G1GC pause times.Dec 30 2015, 8:45 PM

• GWicke renamed this task from Test multiple Cassandra instances per hardware node to Finish conversion to Cassandra instances per hardware node.Jan 28 2016, 12:37 AM

• GWicke renamed this task from Finish conversion to Cassandra instances per hardware node to Finish conversion to multiple Cassandra instances per hardware node.

• GWicke added a project: Blocked-on-Operations.Feb 12 2016, 5:28 PM

Eevans mentioned this in T125906: Evaluate Brotli compression for Cassandra.Feb 15 2016, 5:03 PM

Eevans mentioned this in T126619: cassandra slow streaming during (de)commission.Feb 15 2016, 5:10 PM

Eevans mentioned this in T126221: Evaluate efficacy of DateTieredCompactionStrategy.Feb 15 2016, 5:48 PM

With recent hardware expansions in eqiad and codfw, we need to expand the RAID-0 on eight more nodes (2 in eqiad, 6 in codfw). Additionally, we need to replace six nodes in eqiad, for a total of 14 nodes to update.

The current process of decommissioning & re-bootstrapping nodes is relatively time, labor and system resource intense, especially while disk space is tight. At a rate of about one node per week, it would take us at least until mid-May to just convert eqiad and codfw to a single multi-node instance, followed by more time to increase the number of instances. While faster streaming can speed things up slightly, there are limits on how far we can push this without impacting latency.

It would be great if we could minimize the time we spend on this, finish updating codfw before the switch-over next month & unblock other goals. @fgiunchedi and @Eevans, could you investigate the available options and draw up a plan / timeline for the conversion process?

Eevans updated the task description. (Show Details)Feb 16 2016, 2:29 PM

Copying @Eevans' note from https://phabricator.wikimedia.org/T125842#2056268:

I've put together the following for a proposed sequence of tasks, and an estimation of the time required for each. Hopefully this will be helpful in composing an overall timeline with expected completion date.

Completing the expansion of restbase100[7-9]

task	est. duration	comments
bootstrap 1008-b	1.3d	on-going
decomm 1008-a (128 tokens)	0.7d
bootstrap 1008-a (256 tokens)	1.0d
bootstrap 1009-b	1.6d
decomm 1009-a (256 tokens)	0.9d
bootstrap 1009-a (256 tokens)	1.2d

Note: These times take into account the quantity of data to be moved, at a concurrency of 3 streams of 4.5MB/s.

Note: The timing of 1009 depends on the completion of the currently on-going RAID expansion.

Replacing restbase100[1-6]

For each in rack A, B, and D:

seq.	task	est. duration
1	bootstrap 10xx-a	.9d
2	bootstrap 10xx-b	.7d
3	bootstrap 10xx-a	.6d
4	bootstrap 10xx-b	.5d
5	decomm	.6d
6	decomm	.7d

The idea here is to work rack-by-rack, adding two new hardware nodes, and bootstrapping two instances each. Finally, the two existing nodes can be decommisioned and the hardware repurposed. The process then moves to the next rack.

Note: The timing here depends on the arrival, and racking of new hardware.

Note: These times take into account the quantity of data to be moved, at a concurrency of 3 streams of 4.5MB/s. However, as more nodes are added, higher stream concurrencies are possible; Impact to production nodes providing, we might be able to achieve even higher rates. The potential for this higher throughput is even greater for steps 1, 3, and to a lesser degree 5 and 6, as contention becomes less of a factor.

restbase200[1-6] / codfw datacenter

This is still something of a question mark, as it's not clear (to me at least) whether the plan is to add disks to the existing nodes, or to add 3 additional ones (though either way it should look like some combination of the above).

@Eevans, thanks for starting work on this. Could you work with @fgiunchedi to translate this into a rough timeline that would be useful for planning purposes?

Mentioned in SAL [2016-02-27T03:14:29Z] <urandom> bootstrap of restbase1008-a.eqiad.wmnet complete; begining `nodetool cleanup' of 1003, 1004-a, and 1008-b : T95253

Mentioned in SAL [2016-02-29T09:58:39Z] <godog> bootstrap restbase1009-b T95253

Eevans mentioned this in T126629: Cassandra 2.2.6.Feb 29 2016, 6:57 PM

On bootstrap timings:

The numbers presented in #2067276 are based on established stream throughput figures of 13.5MB/s (3 streams at 4.5MB/s). Higher concurrency will occur, as a natural consequence of an increased instance count and this will in turn increase the potential throughput. However, we need to keep the impact to cluster operations in mind, and throttle accordingly; It's very likely that throughputs that impact latency will be possible as instance counts increase.

Some observations now that the bootstrap of 1010-a has begun:

We'll need to be mindful of where the additional concurrency is coming from. For example, 1010-a has 4 incoming streams (highest we've seen yet), 1 each from 1001 and 1002 (each physical nodes), and 1 each from 1007-a and 1007-b (which are on the same physical host). Hosts 1001 and 1002 are streaming out at 4.5MB/s, and 1007 at 9MB/s. These rates don't seem to be impacting latencies, but it's something to keep an eye on.
1010-a is bootstrapping at 18MB/s, but as it is the only instance on this machine, and since it is boostrapping, any impact caused by that rate is moot. However, when we starting bootstrapping 1010-b, we'll have a hypothetical rate of 22.5MB/s, 4.5MB/s of which will be coming from 1010-a (which will be online at that point). This could have some impact on 1010-a's performance.

I propose that we throttle 1001, 1002, 1007-a, 1007-b, and 1010-a to 30mbps before bootstrapping 1010-b (nodetool setstreamthroughput -- 30). 30mbps * 5 is 150mbps, or ~18MB/s (the rate 1010-a is bootstrapping at now). We can then gradually ramp that up while monitoring latencies.

Mentioned in SAL [2016-03-02T16:28:14Z] <urandom> starting post-bootstrap (1009-b) cleanup on restbase100{5,6,9-a}.eqiad.wmnet : T95253

Mentioned in SAL [2016-03-03T15:12:49Z] <godog> cassandra throttle 1001, 1002, 1007-a, 1007-b, and 1010-a to 30mbps T95253

Mentioned in SAL [2016-03-03T17:58:04Z] <urandom> increasing stream throughput for restbase1010-b.eqiad.wmnet boostrap by 25mbps (5x5) : T128107 T95253

In T95253#2080114, @Eevans wrote:

On bootstrap timings:

The numbers presented in #2067276 are based on established stream throughput figures of 13.5MB/s (3 streams at 4.5MB/s). Higher concurrency will occur, as a natural consequence of an increased instance count and this will in turn increase the potential throughput. However, we need to keep the impact to cluster operations in mind, and throttle accordingly; It's very likely that throughputs that impact latency will be possible as instance counts increase.

Some observations now that the bootstrap of 1010-a has begun:

We'll need to be mindful of where the additional concurrency is coming from. For example, 1010-a has 4 incoming streams (highest we've seen yet), 1 each from 1001 and 1002 (each physical nodes), and 1 each from 1007-a and 1007-b (which are on the same physical host). Hosts 1001 and 1002 are streaming out at 4.5MB/s, and 1007 at 9MB/s. These rates don't seem to be impacting latencies, but it's something to keep an eye on.

1010-a is bootstrapping at 18MB/s, but as it is the only instance on this machine, and since it is boostrapping, any impact caused by that rate is moot. However, when we starting bootstrapping 1010-b, we'll have a hypothetical rate of 22.5MB/s, 4.5MB/s of which will be coming from 1010-a (which will be online at that point). This could have some impact on 1010-a's performance.

I propose that we throttle 1001, 1002, 1007-a, 1007-b, and 1010-a to 30mbps before bootstrapping 1010-b (nodetool setstreamthroughput -- 30). 30mbps * 5 is 150mbps, or ~18MB/s (the rate 1010-a is bootstrapping at now). We can then gradually ramp that up while monitoring latencies.

The bootstrap of 1010-b is underway, and an outbound stream throughput of 30mbps on 1001, 1002, 1007-{a,b}, and 1010-a is in place. Based on columnfamily read latency, 1010-a is being impacted. You can also just see this starting to effect 99p latency in RESTBase.

Mentioned in SAL [2016-03-03T18:14:27Z] <urandom> lowering outbound stream throughput limit on restbase1010-a.eqiad.wmnet to 25mbps : T128107 T95253

Mentioned in SAL [2016-03-04T03:28:28Z] <urandom> starting decomission of restbase1009.eqiad.wmnet : T95253

Mentioned in SAL [2016-03-04T03:34:07Z] <urandom> Starting `nodetool cleanup' on restbase100{1,2,7-a,7-b}.eqiad.wmnet and restbase1010-a : T95253

• GWicke added a subtask: T130540: Figure out if nodes in different DCs can be bootstrapped in parallel.Mar 21 2016, 3:51 PM

on sunday restbase2004 ran out of disk space while bootstrapping 2004-b

12:45 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a service on restbase2004 is CRITICAL: CRITICAL - Expecting active but 
unit cassandra-a is failed
12:45 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but 
unit cassandra-b is failed
12:45 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: Connection refuse
d
13:02 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on restbase2004 is OK: DISK OK
13:02 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a service on restbase2004 is OK: OK - cassandra-a is active
13:03 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active
13:03 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.040 second 
response time on port 9042

most likely trying to bootstrap another instance plus running compactions on big sstables (e.g. local_group_wikipedia_T_parsoid_html is 1.5T on restbase2004-a) made it run out of disk space, we'll likely need to start rebootstrapping 2004-b

Mentioned in SAL [2016-04-04T16:36:08Z] <urandom> Restarting bootstrap of restbase2004.codfw.wmnet : T95253

Mentioned in SAL [2016-04-15T14:06:26Z] <urandom> start decommission of restbase1009-a.eqiad.wmnet : T95253

Mentioned in SAL [2016-04-16T02:16:27Z] <urandom> Bootstraping restbase1009-a.eqiad.wmnet : T95253

Mentioned in SAL [2016-04-17T00:10:46Z] <urandom> Decommissioning restbase1006.eqiad.wmnet : T95253

Mentioned in SAL [2016-04-17T21:50:17Z] <urandom> `systemctl mask cassandra' on restbase1006.eqiad.wmnet (node is decommissioned) : T95253

Mentioned in SAL [2016-04-17T21:51:28Z] <urandom> Decommissioning restbase1005.eqiad.wmnet : T95253

Eevans added a project: Cassandra.Apr 29 2016, 8:20 PM

Eevans mentioned this in T134016: RESTBase Cassandra cluster: Increase instance count to 3.Apr 29 2016, 9:38 PM

Eevans closed subtask T130540: Figure out if nodes in different DCs can be bootstrapped in parallel as Resolved.Apr 29 2016, 9:44 PM

It would appear the bootstrap of restbase2008-b.codfw.wmnet has encountered an error while streaming from 2008-a:

$ grep -i error /var/log/cassandra/system-b.log
...
ERROR [STREAM-IN-/10.192.32.143] 2016-05-16 14:54:39,830 StreamSession.java:621 - [Stream #b8a97f50-1b48-11e6-83cd-71365c354c46] Remote peer 10.192.32.143 failed stream session.
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-16 14:54:39,839 StreamSession.java:505 - [Stream #b8a97f50-1b48-11e6-83cd-71365c354c46] Streaming error occurred
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-16 14:54:43,297 StreamSession.java:505 - [Stream #b8a97f50-1b48-11e6-83cd-71365c354c46] Streaming error occurred
$ grep -i error /var/log/cassandra/system-a.log
...
ERROR [STREAM-IN-/10.192.32.144] 2016-05-16 14:52:54,775 StreamSession.java:505 - [Stream #b8a97f50-1b48-11e6-83cd-71365c354c46] Streaming error occurred
ERROR [STREAM-OUT-/10.192.32.144] 2016-05-16 14:54:13,360 StreamSession.java:505 - [Stream #b8a97f50-1b48-11e6-83cd-71365c354c46] Streaming error occurred

Eevans moved this task from Backlog to In-Progress on the Cassandra board.May 16 2016, 4:20 PM

indeed this seems to have happened again, also while talking from -a to -b, the rest seems unaffected

system-a

ERROR [STREAM-IN-/10.192.32.144] 2016-05-17 04:26:01,723 StreamSession.java:505 - [Stream #e2aa8620-1b87-11e6-a7a3-71365c354c46] Streaming error occurred
java.net.SocketTimeoutException: null
        at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) ~[na:1.8.0_91]
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.8.0_91]
        at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.8.0_91]
        at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:257) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

system-b

ERROR [STREAM-IN-/10.192.32.143] 2016-05-17 04:30:49,520 StreamSession.java:621 - [Stream #e2aa8620-1b87-11e6-a7a3-71365c354c46] Remote peer 10.192.32.143 failed stream session.
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-17 04:30:49,527 StreamSession.java:505 - [Stream #e2aa8620-1b87-11e6-a7a3-71365c354c46] Streaming error occurred
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_91]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_91]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_91]
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) ~[na:1.8.0_91]
        at java.nio.channels.Channels.writeFully(Channels.java:98) ~[na:1.8.0_91]
        at java.nio.channels.Channels.access$000(Channels.java:61) ~[na:1.8.0_91]
        at java.nio.channels.Channels$1.write(Channels.java:174) ~[na:1.8.0_91]
        at java.io.OutputStream.write(OutputStream.java:75) ~[na:1.8.0_91]
        at java.nio.channels.Channels$1.write(Channels.java:155) ~[na:1.8.0_91]
        at org.apache.cassandra.io.util.DataOutputStreamPlus.write(DataOutputStreamPlus.java:45) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.io.util.AbstractDataOutput.writeLong(AbstractDataOutput.java:227) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.utils.UUIDSerializer.serialize(UUIDSerializer.java:34) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.ReceivedMessage$1.serialize(ReceivedMessage.java:43) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.ReceivedMessage$1.serialize(ReceivedMessage.java:34) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:45) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:358) [apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:330) [apache-cassandra-2.1.14.jar:2.1.14]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
INFO  [STREAM-IN-/10.192.32.143] 2016-05-17 04:30:50,765 StreamResultFuture.java:180 - [Stream #e2aa8620-1b87-11e6-a7a3-71365c354c46] Session with /10.192.32.143 is complete
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-17 04:30:50,766 StreamSession.java:505 - [Stream #e2aa8620-1b87-11e6-a7a3-71365c354c46] Streaming error occurred
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_91]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_91]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_91]
        at org.apache.cassandra.io.util.DataOutputStreamAndChannel.write(DataOutputStreamAndChannel.java:48) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:44) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:358) [apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:338) [apache-cassandra-2.1.14.jar:2.1.14]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Mentioned in SAL [2016-05-17T18:34:40Z] <urandom> Restart restbase2008-a.codfw.wmnet; Hail Mary pass for failed 2008-b bootstraps : T95253

Mentioned in SAL [2016-05-17T18:36:07Z] <urandom> Restarting (failed) bootstrap of restbase2008-b.codfw.wmnet : T95253

same error today after some hours on restbase2008-a

ERROR [STREAM-IN-/10.192.32.144] 2016-05-17 21:04:44,489 StreamSession.java:505 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Streaming error occurred
java.net.SocketTimeoutException: null
        at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) ~[na:1.8.0_91]
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.8.0_91]
        at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.8.0_91]
        at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:257) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

and restbase2008-b

ERROR [STREAM-IN-/10.192.32.143] 2016-05-17 21:28:24,018 StreamSession.java:621 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Remote peer 10.192.32.143 failed stream session.
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-17 21:28:24,026 StreamSession.java:505 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Streaming error occurred
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_91]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_91]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_91]
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) ~[na:1.8.0_91]
        at java.nio.channels.Channels.writeFully(Channels.java:98) ~[na:1.8.0_91]
        at java.nio.channels.Channels.access$000(Channels.java:61) ~[na:1.8.0_91]
        at java.nio.channels.Channels$1.write(Channels.java:174) ~[na:1.8.0_91]
        at java.io.OutputStream.write(OutputStream.java:75) ~[na:1.8.0_91]
        at java.nio.channels.Channels$1.write(Channels.java:155) ~[na:1.8.0_91]
        at org.apache.cassandra.io.util.DataOutputStreamPlus.write(DataOutputStreamPlus.java:45) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.io.util.AbstractDataOutput.writeLong(AbstractDataOutput.java:227) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.utils.UUIDSerializer.serialize(UUIDSerializer.java:34) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.ReceivedMessage$1.serialize(ReceivedMessage.java:43) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.ReceivedMessage$1.serialize(ReceivedMessage.java:34) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:45) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:358) [apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:330) [apache-cassandra-2.1.14.jar:2.1.14]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
INFO  [NativePoolCleaner] 2016-05-17 21:28:24,216 ColumnFamilyStore.java:1211 - Flushing largest CFS(Keyspace='local_group_wikipedia_T_mobileapps_lead', ColumnFamily='data') to free up room. Used total: 0.04/0.33, live: 0.04/0.33, flushing
: 0.00/0.00, this: 0.00/0.00
INFO  [NativePoolCleaner] 2016-05-17 21:28:24,217 ColumnFamilyStore.java:905 - Enqueuing flush of data: 2892772 (0%) on-heap, 97611360 (3%) off-heap
INFO  [MemtableFlushWriter:87] 2016-05-17 21:28:24,219 Memtable.java:347 - Writing Memtable-data@1982421767(83.903MiB serialized bytes, 44052 ops, 0%/3% of on/off-heap limit)
INFO  [MemtableFlushWriter:87] 2016-05-17 21:28:26,257 Memtable.java:382 - Completed flushing /srv/cassandra-b/data/local_group_wikipedia_T_mobileapps_lead/data-2b8478e08e0911e5ab9881ba0e170b9f/local_group_wikipedia_T_mobileapps_lead-data-
tmp-ka-15-Data.db (9.862MiB) for commitlog position ReplayPosition(segmentId=1463510295970, position=4497148)
INFO  [STREAM-IN-/10.192.32.143] 2016-05-17 21:28:26,566 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.32.143 is complete
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-17 21:28:26,566 StreamSession.java:505 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Streaming error occurred
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_91]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_91]
        at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_91]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_91]
        at org.apache.cassandra.io.util.DataOutputStreamAndChannel.write(DataOutputStreamAndChannel.java:48) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:44) ~[apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:358) [apache-cassandra-2.1.14.jar:2.1.14]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:338) [apache-cassandra-2.1.14.jar:2.1.14]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

also grepping for the stream id

restbase2008:/var/log/cassandra$ fgrep 9a455f40-1c5e-11e6-ad1e-7136 system-b.log
INFO  [main] 2016-05-17 18:45:42,189 StreamResultFuture.java:86 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Executing streaming plan for Bootstrap
INFO  [StreamConnectionEstablisher:1] 2016-05-17 18:45:42,190 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.48.130
INFO  [StreamConnectionEstablisher:2] 2016-05-17 18:45:42,190 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.32.195
INFO  [StreamConnectionEstablisher:3] 2016-05-17 18:45:42,190 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.48.135
INFO  [StreamConnectionEstablisher:4] 2016-05-17 18:45:42,190 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.48.136
INFO  [StreamConnectionEstablisher:5] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.32.137
INFO  [StreamConnectionEstablisher:6] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.32.202
INFO  [StreamConnectionEstablisher:7] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.48.138
INFO  [StreamConnectionEstablisher:8] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.32.203
INFO  [StreamConnectionEstablisher:9] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.48.139
INFO  [StreamConnectionEstablisher:10] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.32.205
INFO  [StreamConnectionEstablisher:11] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.32.206
INFO  [StreamConnectionEstablisher:12] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.32.143
INFO  [StreamConnectionEstablisher:13] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.162
INFO  [StreamConnectionEstablisher:14] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.163
INFO  [StreamConnectionEstablisher:15] 2016-05-17 18:45:42,191 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.164
INFO  [StreamConnectionEstablisher:16] 2016-05-17 18:45:42,192 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.48.37
INFO  [StreamConnectionEstablisher:17] 2016-05-17 18:45:42,192 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.165
INFO  [StreamConnectionEstablisher:18] 2016-05-17 18:45:42,193 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.48.38
INFO  [StreamConnectionEstablisher:19] 2016-05-17 18:45:42,193 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.0.230
INFO  [StreamConnectionEstablisher:20] 2016-05-17 18:45:42,193 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.166
INFO  [StreamConnectionEstablisher:21] 2016-05-17 18:45:42,193 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.0.231
INFO  [StreamConnectionEstablisher:22] 2016-05-17 18:45:42,194 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.167
INFO  [StreamConnectionEstablisher:23] 2016-05-17 18:45:42,194 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.176
INFO  [StreamConnectionEstablisher:24] 2016-05-17 18:45:42,195 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.16.177
INFO  [StreamConnectionEstablisher:25] 2016-05-17 18:45:42,195 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.0.114
INFO  [StreamConnectionEstablisher:26] 2016-05-17 18:45:42,196 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.0.115
INFO  [StreamConnectionEstablisher:27] 2016-05-17 18:45:42,196 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.0.117
INFO  [StreamConnectionEstablisher:28] 2016-05-17 18:45:42,196 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.48.54
INFO  [StreamConnectionEstablisher:29] 2016-05-17 18:45:42,197 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.0.118
INFO  [StreamConnectionEstablisher:30] 2016-05-17 18:45:42,197 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.48.55
INFO  [StreamConnectionEstablisher:31] 2016-05-17 18:45:42,201 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.48.120
INFO  [StreamConnectionEstablisher:32] 2016-05-17 18:45:42,201 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.64.32.187
INFO  [StreamConnectionEstablisher:24] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.177
INFO  [StreamConnectionEstablisher:18] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.48.38
INFO  [StreamConnectionEstablisher:22] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.167
INFO  [StreamConnectionEstablisher:28] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.48.54
INFO  [StreamConnectionEstablisher:15] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.164
INFO  [StreamConnectionEstablisher:30] 2016-05-17 18:45:42,205 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.48.55
INFO  [StreamConnectionEstablisher:23] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.176
INFO  [StreamConnectionEstablisher:20] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.166
INFO  [StreamConnectionEstablisher:14] 2016-05-17 18:45:42,205 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.163
INFO  [StreamConnectionEstablisher:17] 2016-05-17 18:45:42,205 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.165
INFO  [StreamConnectionEstablisher:16] 2016-05-17 18:45:42,205 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.48.37
INFO  [StreamConnectionEstablisher:12] 2016-05-17 18:45:42,205 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.32.143
INFO  [StreamConnectionEstablisher:5] 2016-05-17 18:45:42,205 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.32.137
INFO  [StreamConnectionEstablisher:24] 2016-05-17 18:45:42,205 StreamSession.java:220 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Starting streaming to /10.192.32.124
INFO  [StreamConnectionEstablisher:13] 2016-05-17 18:45:42,204 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.16.162
INFO  [StreamConnectionEstablisher:24] 2016-05-17 18:45:42,207 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.192.32.124
INFO  [STREAM-IN-/10.192.16.163] 2016-05-17 18:45:42,213 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.163 is complete
INFO  [STREAM-IN-/10.192.16.162] 2016-05-17 18:45:42,213 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.162 is complete
INFO  [STREAM-IN-/10.192.16.177] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.177 is complete
INFO  [STREAM-IN-/10.192.48.55] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.48.55 is complete
INFO  [STREAM-IN-/10.192.16.165] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.165 is complete
INFO  [STREAM-IN-/10.192.16.167] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.167 is complete
INFO  [STREAM-IN-/10.192.48.54] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.48.54 is complete
INFO  [STREAM-IN-/10.192.16.166] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.166 is complete
INFO  [STREAM-IN-/10.192.16.176] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.176 is complete
INFO  [STREAM-IN-/10.192.48.37] 2016-05-17 18:45:42,215 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.48.37 is complete
INFO  [STREAM-IN-/10.192.16.164] 2016-05-17 18:45:42,218 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.16.164 is complete
INFO  [STREAM-IN-/10.192.48.38] 2016-05-17 18:45:42,563 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.48.38 is complete
INFO  [StreamConnectionEstablisher:7] 2016-05-17 18:45:42,722 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.48.138
INFO  [StreamConnectionEstablisher:4] 2016-05-17 18:45:42,722 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.48.136
INFO  [StreamConnectionEstablisher:6] 2016-05-17 18:45:42,722 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.32.202
INFO  [StreamConnectionEstablisher:2] 2016-05-17 18:45:42,723 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.32.195
INFO  [StreamConnectionEstablisher:3] 2016-05-17 18:45:42,723 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.48.135
INFO  [StreamConnectionEstablisher:10] 2016-05-17 18:45:42,726 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.32.205
INFO  [StreamConnectionEstablisher:8] 2016-05-17 18:45:42,726 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.32.203
INFO  [StreamConnectionEstablisher:9] 2016-05-17 18:45:42,727 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.48.139
INFO  [StreamConnectionEstablisher:11] 2016-05-17 18:45:42,727 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.32.206
INFO  [StreamConnectionEstablisher:32] 2016-05-17 18:45:42,728 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.32.187
INFO  [StreamConnectionEstablisher:1] 2016-05-17 18:45:42,728 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.48.130
INFO  [StreamConnectionEstablisher:26] 2016-05-17 18:45:42,731 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.0.115
INFO  [StreamConnectionEstablisher:25] 2016-05-17 18:45:42,732 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.0.114
INFO  [StreamConnectionEstablisher:29] 2016-05-17 18:45:42,733 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.0.118
INFO  [StreamConnectionEstablisher:19] 2016-05-17 18:45:42,733 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.0.230
INFO  [StreamConnectionEstablisher:21] 2016-05-17 18:45:42,738 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.0.231
INFO  [StreamConnectionEstablisher:31] 2016-05-17 18:45:42,740 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.48.120
INFO  [StreamConnectionEstablisher:27] 2016-05-17 18:45:42,741 StreamCoordinator.java:209 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Beginning stream session with /10.64.0.117
INFO  [STREAM-IN-/10.64.0.114] 2016-05-17 18:45:42,893 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.0.114 is complete
INFO  [STREAM-IN-/10.64.48.136] 2016-05-17 18:45:42,894 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.48.136 is complete
INFO  [STREAM-IN-/10.64.32.202] 2016-05-17 18:45:42,894 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.32.202 is complete
INFO  [STREAM-IN-/10.64.48.135] 2016-05-17 18:45:42,894 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.48.135 is complete
INFO  [STREAM-IN-/10.64.0.115] 2016-05-17 18:45:42,894 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.0.115 is complete
INFO  [STREAM-IN-/10.64.32.203] 2016-05-17 18:45:42,894 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.32.203 is complete
INFO  [STREAM-IN-/10.64.0.118] 2016-05-17 18:45:42,898 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.0.118 is complete
INFO  [STREAM-IN-/10.64.0.230] 2016-05-17 18:45:42,898 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.0.230 is complete
INFO  [STREAM-IN-/10.64.32.205] 2016-05-17 18:45:42,898 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.32.205 is complete
INFO  [STREAM-IN-/10.64.32.187] 2016-05-17 18:45:42,898 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.32.187 is complete
INFO  [STREAM-IN-/10.64.48.130] 2016-05-17 18:45:42,901 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.48.130 is complete
INFO  [STREAM-IN-/10.64.0.231] 2016-05-17 18:45:42,901 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.0.231 is complete
INFO  [STREAM-IN-/10.64.32.195] 2016-05-17 18:45:42,902 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.32.195 is complete
INFO  [STREAM-IN-/10.64.0.117] 2016-05-17 18:45:42,902 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.0.117 is complete
INFO  [STREAM-IN-/10.64.32.206] 2016-05-17 18:45:42,902 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.32.206 is complete
INFO  [STREAM-IN-/10.64.48.138] 2016-05-17 18:45:42,902 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.48.138 is complete
INFO  [STREAM-IN-/10.64.48.139] 2016-05-17 18:45:42,902 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.48.139 is complete
INFO  [STREAM-IN-/10.64.48.120] 2016-05-17 18:45:42,904 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.64.48.120 is complete
INFO  [STREAM-IN-/10.192.32.143] 2016-05-17 18:47:08,666 StreamResultFuture.java:166 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46 ID#0] Prepare completed. Receiving 2513 files(511403311497 bytes), sending 0 files(0 bytes)
INFO  [STREAM-IN-/10.192.32.124] 2016-05-17 18:47:11,968 StreamResultFuture.java:166 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46 ID#0] Prepare completed. Receiving 2576 files(512166422291 bytes), sending 0 files(0 bytes)
INFO  [STREAM-IN-/10.192.32.137] 2016-05-17 18:47:22,416 StreamResultFuture.java:166 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46 ID#0] Prepare completed. Receiving 2566 files(624444080419 bytes), sending 0 files(0 bytes)
ERROR [STREAM-IN-/10.192.32.143] 2016-05-17 21:28:24,018 StreamSession.java:621 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Remote peer 10.192.32.143 failed stream session.
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-17 21:28:24,026 StreamSession.java:505 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Streaming error occurred
INFO  [STREAM-IN-/10.192.32.143] 2016-05-17 21:28:26,566 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.32.143 is complete
ERROR [STREAM-OUT-/10.192.32.143] 2016-05-17 21:28:26,566 StreamSession.java:505 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Streaming error occurred

note that restbase200[789] are part of the new batch of hardware and are running kernel 4.4 (vs 3.19) and openjdk 8u91-b14 vs 8u72-b15, but restbase200[79] bootstrapped both instances fine

same grep for stream id on restbase2008-a

restbase2008:/var/log$ fgrep 9a455f40-1c5e-11e6-ad1e-7136 cassandra/system-a.log
INFO  [STREAM-INIT-/10.192.32.144:39146] 2016-05-17 18:45:42,203 StreamResultFuture.java:109 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46 ID#0] Creating new streaming plan for Bootstrap
INFO  [STREAM-INIT-/10.192.32.144:39146] 2016-05-17 18:45:42,216 StreamResultFuture.java:116 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Received streaming plan for Bootstrap
INFO  [STREAM-INIT-/10.192.32.144:44086] 2016-05-17 18:45:42,217 StreamResultFuture.java:116 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46, ID#0] Received streaming plan for Bootstrap
INFO  [STREAM-IN-/10.192.32.144] 2016-05-17 18:47:08,657 StreamResultFuture.java:166 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46 ID#0] Prepare completed. Receiving 0 files(0 bytes), sending 2513 files(511403311497 bytes)
ERROR [STREAM-IN-/10.192.32.144] 2016-05-17 21:04:44,489 StreamSession.java:505 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Streaming error occurred
INFO  [STREAM-IN-/10.192.32.144] 2016-05-17 21:27:55,694 StreamResultFuture.java:180 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Session with /10.192.32.144 is complete
ERROR [STREAM-OUT-/10.192.32.144] 2016-05-17 21:27:55,697 StreamSession.java:505 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Streaming error occurred
WARN  [STREAM-IN-/10.192.32.144] 2016-05-17 21:27:55,699 StreamResultFuture.java:207 - [Stream #9a455f40-1c5e-11e6-ad1e-71365c354c46] Stream failed

also judging from the timestamps at 2016-05-17 21:04:44,489 2008-a logs a SocketTimeoutException while read()ing from -b and 24m later -b gets IOException: broken pipe from the kernel, which I'm assuming is tearing down the socket

also reported a failed command yesterday morning in kern.log, no other indication of errors though and time-wise doesn't seem to line up with the bootstraps

kern.log:May 17 09:58:36 restbase2008 kernel: [1261402.715474] hpsa 0000:03:00.0: scsi 0:1:0:0 Aborting command ffff8813d97087c0Tag:0x00000000:00000120 CDBLen: 10 CDB: 0x2a00... SN: 0x0  BEING SENT
kern.log:May 17 09:58:36 restbase2008 kernel: [1261402.715480] hpsa 0000:03:00.0: scsi 0:1:0:0: Aborting command Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
kern.log:May 17 09:58:36 restbase2008 kernel: [1261402.715525] hpsa 0000:03:00.0: scsi 0:1:0:0 Aborting command ffff8813d97087c0Tag:0x00000000:00000120 CDBLen: 10 CDB: 0x2a00... SN: 0x0  SENT, FAILED
kern.log:May 17 09:58:36 restbase2008 kernel: [1261402.715530] hpsa 0000:03:00.0: scsi 0:1:0:0: FAILED to abort command Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
kern.log:May 17 09:58:37 restbase2008 kernel: [1261403.643381] hpsa 0000:03:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
kern.log:May 17 09:59:08 restbase2008 kernel: [1261434.531489] hpsa 0000:03:00.0: scsi 0:1:0:0: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1

One difference with 2008 is that it is running Cassandra 2.1.14:

restbase1007.eqiad.wmnet:   Installed: 2.1.13
restbase1010.eqiad.wmnet:   Installed: 2.1.13
restbase1011.eqiad.wmnet:   Installed: 2.1.13
restbase1008.eqiad.wmnet:   Installed: 2.1.13
restbase1012.eqiad.wmnet:   Installed: 2.1.13
restbase1013.eqiad.wmnet:   Installed: 2.1.13
restbase1009.eqiad.wmnet:   Installed: 2.1.13
restbase1014.eqiad.wmnet:   Installed: 2.1.13
restbase1015.eqiad.wmnet:   Installed: 2.1.13
restbase2003.codfw.wmnet:   Installed: 2.1.13
restbase2004.codfw.wmnet:   Installed: 2.1.13
restbase2008.codfw.wmnet:   Installed: 2.1.14
restbase2001.codfw.wmnet:   Installed: 2.1.13
restbase2002.codfw.wmnet:   Installed: 2.1.13
restbase2007.codfw.wmnet:   Installed: 2.1.13
restbase2005.codfw.wmnet:   Installed: 2.1.13
restbase2006.codfw.wmnet:   Installed: 2.1.13
restbase2009.codfw.wmnet:   Installed: 2.1.13

This is no doubt a very easy mistake to make if for example the apt repo were to get updated between installs.

This isn't even the only current example (ping @elukey):

aqs1001.eqiad.wmnet:   Installed: 2.1.12
aqs1002.eqiad.wmnet:   Installed: 2.1.12
aqs1003.eqiad.wmnet:   Installed: 2.1.12
aqs1004.eqiad.wmnet:   Installed: 2.1.14
aqs1005.eqiad.wmnet:   Installed: 2.1.14
aqs1006.eqiad.wmnet:   Installed: 2.1.14

We really need to the ability to host multiple versions of the same package; Having all Cassandra clusters moving in lock-step isn't an option. In fact, for Cassandra clusters we might even go a step further, and puppetize apt pinning, so that upgrades can only occur explicitly.

I will follow-up with any ticket(s) needed for this issue separately.

Mentioned in SAL [2016-05-18T18:31:27Z] <urandom> Stopping failed bootstrap of restbase2008-b.codfw.wmnet : T95253

Mentioned in SAL [2016-05-18T18:32:34Z] <urandom> Stopping restbase2008-a.codfw.wmnet and downgrading Cassandra to 2.1.13 : T95253

Mentioned in SAL [2016-05-18T18:35:02Z] <urandom> Starting restbase2008-a.codfw.wmnet : T95253

Mentioned in SAL [2016-05-18T18:40:35Z] <urandom> Starting bootstrap of restbase2008-b.codfw.wmnet : T95253

Cassandra has now been downgraded to 2.1.13 on restbase2008.codfw.wmnet, and the bootstrap of 2008-b has been restarted.

Eevans mentioned this in T115758: Debian repository supporting multiple package versions.May 18 2016, 7:12 PM

In T95253#2305919, @Eevans wrote:

[ ... ]
I will follow-up with any ticket(s) needed for this issue separately.

See T135673: Downgrade Cassandra on apt.wikimedia.org to 2.1.13 and T115758: Debian repository supporting multiple package versions.

@Eevans I stole the 2.1.13 debs in your home dir on restbase2008 and downgraded aqs100[456] :)

elukey@neodymium:~$ sudo -i salt -t 120 aqs100[456]* cmd.run 'dpkg --list |  grep cassandra'
aqs1005.eqiad.wmnet:
    ii  cassandra                      2.1.13                     all          distributed storage system for structured data
    ii  cassandra-tools                2.1.13                     all          distributed storage system for structured data
aqs1006.eqiad.wmnet:
    ii  cassandra                      2.1.13                     all          distributed storage system for structured data
    ii  cassandra-tools                2.1.13                     all          distributed storage system for structured data
aqs1004.eqiad.wmnet:
    ii  cassandra                      2.1.13                     all          distributed storage system for structured data
    ii  cassandra-tools                2.1.13                     all          distributed storage system for structured data

In T95253#2308521, @elukey wrote:

@Eevans I stole the 2.1.13 debs in your home dir on restbase2008 and downgraded aqs100[456] :)

elukey@neodymium:~$ sudo -i salt -t 120 aqs100[456]* cmd.run 'dpkg --list |  grep cassandra'
aqs1005.eqiad.wmnet:
    ii  cassandra                      2.1.13                     all          distributed storage system for structured data
    ii  cassandra-tools                2.1.13                     all          distributed storage system for structured data
aqs1006.eqiad.wmnet:
    ii  cassandra                      2.1.13                     all          distributed storage system for structured data
    ii  cassandra-tools                2.1.13                     all          distributed storage system for structured data
aqs1004.eqiad.wmnet:
    ii  cassandra                      2.1.13                     all          distributed storage system for structured data
    ii  cassandra-tools                2.1.13                     all          distributed storage system for structured data

@elukey Cool! Remember to upgrade the existing machines too (at least prior to adding the new nodes to that cluster).

$ cdsh -c aqs -- "grep -i installed <(apt-cache policy cassandra)"
aqs1001.eqiad.wmnet:   Installed: 2.1.12
aqs1002.eqiad.wmnet:   Installed: 2.1.12
aqs1003.eqiad.wmnet:   Installed: 2.1.12
aqs1004.eqiad.wmnet:   Installed: 2.1.13
aqs1005.eqiad.wmnet:   Installed: 2.1.13
aqs1006.eqiad.wmnet:   Installed: 2.1.13

In T95253#2309216, @Eevans wrote:

@elukey Cool! Remember to upgrade the existing machines too (at least prior to adding the new nodes to that cluster).

@Eevans aqs100[123] upgraded to 2.1.13 today!

In T95253#2306337, @Eevans wrote:

Cassandra has now been downgraded to 2.1.13 on restbase2008.codfw.wmnet, and the bootstrap of 2008-b has been restarted.

the bootstrap completed just now, I'll followup on T132976 with the remaining steps!

Change 290243 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add restbase2005 instances

https://gerrit.wikimedia.org/r/290243

Change 290244 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add restbase2006 instances

https://gerrit.wikimedia.org/r/290244

Change 290243 merged by Filippo Giunchedi:
cassandra: add restbase2005 instances

https://gerrit.wikimedia.org/r/290243

fgiunchedi mentioned this in rOPUP774873e507ca: cassandra: add restbase2005 instances.May 23 2016, 3:11 PM

Change 290244 merged by Filippo Giunchedi:
cassandra: add restbase2006 instances

https://gerrit.wikimedia.org/r/290244

fgiunchedi mentioned this in rOPUP09121e7e3397: cassandra: add restbase2006 instances.May 24 2016, 9:05 AM

all restbase machines are multi-instance now, pending addition of additional instances for restbase200[356]

Change 290505 had a related patch set uploaded (by Eevans):
enable restbase2006-b.codfw.wmnet

https://gerrit.wikimedia.org/r/290505

Change 290505 merged by Dzahn:
enable restbase2006-b.codfw.wmnet

https://gerrit.wikimedia.org/r/290505

Mentioned in SAL [2016-05-24T18:08:30Z] <urandom> Starting bootstrap of restbase2006-b.codfw.wmnet : T95253

Dzahn mentioned this in rOPUP5611d8e4c9ec: enable restbase2006-b.codfw.wmnet.May 24 2016, 6:11 PM

I think the conversion is technically complete; There are more instances to bootstrap, but we can probably consider them under the scope of: T134016: RESTBase Cassandra cluster: Increase instance count to 3

Should we close this issue as resolved?

I agree this is complete, let's followup on T134016, resolving!

Liuxinyu970226 unsubscribed.May 27 2016, 11:59 PM

Eevans mentioned this in rOPUP7e3c67ec344f: enable restbase2006-b.codfw.wmnet.Jun 17 2016, 6:08 PM

fgiunchedi mentioned this in rOPUP8ea93d6ad44c: cassandra: add restbase2005 instances.Jun 17 2016, 6:08 PM

fgiunchedi mentioned this in rOPUP202abfc4cdb6: cassandra: add restbase2006 instances.

• GWicke closed subtask T117114: Ensure ansible-deploy can cope with multi-instance restarts as Resolved.Oct 12 2016, 5:29 PM

fgiunchedi closed subtask T113733: column family cassandra metrics size as Resolved.Jul 12 2017, 12:05 PM