new restbase machines have base installed completed, pending:
- coordinated puppet signing
- ordinately bring up new cassandra nodes and join the cluster
- add new restbase machines to pybal once cassandra fully in service
new restbase machines have base installed completed, pending:
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Invalid | None | T93751 RFC: Next steps for long-term revision storage -- space needs, storage hierarchies | |||
| Resolved | RobH | T93790 Expand RESTBase cluster capacity | |||
| Resolved | MoritzMuehlenhoff | T100773 Switch to Linux 3.19 by default on jessie hosts | |||
| Resolved | MoritzMuehlenhoff | T102234 Update restbase100[1-6] to the 3.19 kernel | |||
| Resolved | fgiunchedi | T102015 put new restbase servers in service | |||
| Resolved | • Cmjohnson | T102557 investigate new restbase machine disks timeouts | |||
| Resolved | fgiunchedi | T104208 alternative Cassandra metrics reporting |
Change 217678 had a related patch set uploaded (by Filippo Giunchedi):
restbase: add restbase100[789] to manifest and hieradata
Change 217678 merged by Filippo Giunchedi:
restbase: add restbase100[789] to manifest and hieradata
ran through the initial bootstrapping of restbase1007 today with @GWicke
The cassandra version on our apt repo is 2.1.5 which got there on April 29th, likely an unintended side effect of reprepro update which in turns pulls updates from all sources. cassandra 2.1.5 could not successfully stream data from other nodes resulting in NPE, to work around that the cassandra version has been downgraded on restbase1007 to 2.1.3 and that allows for a successful bootstrap and we'll need to do the same for other new nodes (testing 2.1.6 is T101745)
1007 ran into disk trouble:
352515.128962] sd 1:0:0:0: [sdb] Unhandled error code [352515.134401] sd 1:0:0:0: [sdb] [352515.138002] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [352515.144801] sd 1:0:0:0: [sdb] CDB: [352515.148789] Read(10): 28 00 42 cd 5e 60 00 00 08 00 [352515.154449] end_request: I/O error, dev sdb, sector 1120755296
currently looking into that and attempting bootstrap of restbase1008 with @Eevans, we've seen this error coming up on nodes when trying to bootstrap (logs of peer node, already in production)
ERROR [STREAM-IN-/10.64.32.178] 2015-06-15 20:20:43,539 StreamSession.java:477 - [Stream #f5e89ff0-139b-11e5-9881-3fcfb96209d9] Streaming error occurred
java.lang.NullPointerException: null
at org.apache.cassandra.io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:1277) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.getSSTableSectionsForRanges(StreamSession.java:313) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.addTransferRanges(StreamSession.java:266) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:493) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:425) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:251) ~[apache-cassandra-2.1.3.jar:2.1.3]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
INFO [STREAM-IN-/10.64.32.178] 2015-06-15 20:20:43,546 StreamResultFuture.java:180 - [Stream #f5e89ff0-139b-11e5-9881-3fcfb96209d9] Session with /10.64.32.178 is complete
WARN [STREAM-IN-/10.64.32.178] 2015-06-15 20:20:43,547 StreamResultFuture.java:207 - [Stream #f5e89ff0-139b-11e5-9881-3fcfb96209d9] Stream failed
INFO [GossipStage:1] 2015-06-15 20:20:45,554 Gossiper.java:930 - InetAddress /10.64.32.178 is now DOWNand on the node in question
INFO [STREAM-IN-/10.64.32.159] 2015-06-15 20:20:43,539 StreamResultFuture.java:180 - [Stream #f5e89ff0-139b-11e5-9881-3fcfb96209d9] Session with /10.64.32.159 is complete
WARN [STREAM-IN-/10.64.32.159] 2015-06-15 20:20:43,542 StreamResultFuture.java:207 - [Stream #f5e89ff0-139b-11e5-9881-3fcfb96209d9] Stream failed
ERROR [main] 2015-06-15 20:20:43,550 CassandraDaemon.java:482 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1036) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:839) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:635) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:518) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:465) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:554) [apache-cassandra-2.1.3.jar:2.1.3]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[apache-cassandra-2.1.3.jar:2.1.3]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:387) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.sessionFailed(StreamSession.java:592) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:447) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:251) ~[apache-cassandra-2.1.3.jar:2.1.3]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_79]
INFO [StorageServiceShutdownHook] 2015-06-15 20:20:43,553 Gossiper.java:1341 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2015-06-15 20:20:45,553 MessagingService.java:707 - Waiting for messaging service to quiesce
INFO [ACCEPT-/10.64.32.178] 2015-06-15 20:20:45,555 MessagingService.java:955 - MessagingService has terminated the accept() threadThe bootstrap of 1008 that was started yesterday just failed with (the usual):
INFO [StreamReceiveTask:50] 2015-06-16 14:07:43,508 StreamResultFuture.java:180 - [Stream #dfb718e0-139d-11e5-844b-3fcfb96209d9] Session with /10.64.32.159 is complete
WARN [StreamReceiveTask:50] 2015-06-16 14:07:43,510 StreamResultFuture.java:207 - [Stream #dfb718e0-139d-11e5-844b-3fcfb96209d9] Stream failed
ERROR [main] 2015-06-16 14:07:43,511 CassandraDaemon.java:482 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1036) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:839) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:635) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:518) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:465) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:554) [apache-cassandra-2.1.3.jar:2.1.3]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[apache-cassandra-2.1.3.jar:2.1.3]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:387) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:660) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:623) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:139) ~[apache-cassandra-2.1.3.jar:2.1.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_79]What is unusual, is that I cannot find the typical NullPointerException in any of the other 6 nodes' logs.
None of these exceptions are very helpful: I suggest that before going any further (regardless the course), that we move the logging level to DEBUG, (while configuring the rotation to be less aggressive). The signal-to-noise ratio will be very bad, but on the off chance there is any increase in "signal", it'll be worth it. It might even be worth taking one more stab at a bootstrap with DEBUG enabled, before trying other things.
I see a few possible courses of action going forward:
The advantage to (a) is understanding what is broken here. And assuming that we make small, targetted, non-normative changes, it should be the least risky strategy as well. The disadvantage is that it may take longer in the end.
The premise behind (b) is that 2.1.6 should be Less Buggy than 2.1.3, but without knowing what this bug is, there is really no reason to think it is fixed in 2.1.6. And, there is always the possibility it will introduce new regressions. https://issues.apache.org/jira/browse/CASSANDRA-9549 in particular may be a problem for us (more research needed), and at the very least stands as evidence that 2.1.6 is not immune from serious bugs. On the other hand, if it does work, it's the quickest/easiest solution.
Waiting for 2.1.7 would at least mean that CASSANDRA-9549 has been fixed.
I recommend, out of an abundance of caution, that we go with (a).
Good question. I don't know, we'd have to (carefully) try it.
One option would be to use JMX to enable it on a class-by-class and/or package-by-package basis. Either as a test of the impact, or as a strategy to minimize any impact if that proves to be a problem.
For posterity sake:
INFO [STREAM-IN-/10.64.32.159] 2015-06-15 20:34:39,278 StreamResultFuture.java:166 - [Stream #dfb718e0-139d-11e5-844b-3fcfb96209d9 ID#0] Prepare completed. Receiving 2847 files(304524197467 bytes), sending 0 files(0 bytes) ... WARN [StreamReceiveTask:50] 2015-06-16 14:07:43,510 StreamResultFuture.java:207 - [Stream #dfb718e0-139d-11e5-844b-3fcfb96209d9] Stream failed
Each iteration/attempt takes a minimum of ~17.5 hours from bootstrap-start to stream failure.
myself and @Eevans have played yesterday a bit with debug settings for org.apache.cassandra.io.sstable org.apache.cassandra.dht org.apache.cassandra.streaming org.apache.cassandra.service on restbase1003 and tried bootstrapping restbase1008 again, that yielded the same error but no useful debugging. Current plan is to go for (b) and upgrade the fleet to 2.1.6 first.
root@restbase1001:~# nodetool snapshot
Requested creating snapshot(s) for [all keyspaces] with snapshot name [1434558594941]
Snapshot directory: 1434558594941
root@restbase1001:~# service cassandra stop
root@restbase1001:~# apt-get install cassandra
Reading package lists... Done
Building dependency tree
Reading state information... Done
Suggested packages:
cassandra-tools
The following packages will be upgraded:
cassandra
1 upgraded, 0 newly installed, 0 to remove and 157 not upgraded.
Need to get 19.9 MB of archives.
After this operation, 1,475 kB of additional disk space will be used.
Get:1 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/thirdparty cassandra all 2.1.6 [19.9 MB]
Fetched 19.9 MB in 0s (32.4 MB/s)
Reading changelogs... Done
(Reading database ... 43952 files and directories currently installed.)
Preparing to unpack .../cassandra_2.1.6_all.deb ...
Unpacking cassandra (2.1.6) over (2.1.3) ...
Processing triggers for python-support (1.0.15) ...
Setting up cassandra (2.1.6) ...
Installing new version of config file /etc/cassandra/commitlog_archiving.properties ...
Configuration file '/etc/cassandra/cassandra.yaml'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** cassandra.yaml (Y/I/N/O/D/Z) [default=N] ?
Configuration file '/etc/cassandra/cassandra-env.sh'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** cassandra-env.sh (Y/I/N/O/D/Z) [default=N] ?
sysctl: cannot open "/etc/sysctl.d/cassandra.conf": No such file or directory
Warning: unable to set vm.max_map_count; is this an OpenVZ
instance? If so, it is highly recommended that you set
vm.max_map_count to 1048575 in the host.
Deleting the local sysctl.d/cassandra.conf.
update-rc.d: warning: start and stop actions are no longer supported; falling back to defaults
Processing triggers for python-support (1.0.15) ...
root@restbase1001:~#current status, restarting cassandra works however it seems that metrics are being pushed to graphite only for (only?) one period (60s) and then stop. a thread dump shows some metrics threads waiting
"metrics-graphite-reporter-thread-1" daemon prio=10 tid=0x00007f85c469b000 nid=0x7279 waiting on condition [0x00007f7dc5882000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000684e4a0c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"metrics-meter-tick-thread-2" daemon prio=10 tid=0x00007f85c4862800 nid=0x6fb5 waiting on condition [0x00007f85a4bc0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000675a38228> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1085)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"metrics-meter-tick-thread-1" daemon prio=10 tid=0x00007f85c485d000 nid=0x6fb4 waiting on condition [0x00007f85a4c01000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000675a38228> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)which in turn might indicate a deadlock? note that reporting metrics works fine on the test cluster
scratch the deadlock part, threads are in the same state even on a working cassandra 2.1.3 instance
Change 219236 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add restbase1008 to cassandra::seeds
Change 219236 merged by Filippo Giunchedi:
cassandra: add restbase1008 to cassandra::seeds
The latest bootstrap attempt of 1008 has failed as well. The exact reason is unclear because a hardware issue prevents reliably getting at the log data.
... [832404.241192] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [832404.241194] sd 1:0:0:0: [sdb] CDB: [832404.241195] Read(10): 28 00 2c e1 fc 00 00 00 08 00 [832404.241201] end_request: I/O error, dev sdb, sector 753007616 [832409.938536] sd 1:0:0:0: [sdb] Unhandled error code [832409.938541] sd 1:0:0:0: [sdb] [832409.938543] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [832409.938544] sd 1:0:0:0: [sdb] CDB: [832409.938546] Read(10): 28 00 2c e1 fc 00 00 00 08 00 [832409.938553] end_request: I/O error, dev sdb, sector 753007616 [832409.945133] sd 1:0:0:0: [sdb] Unhandled error code [832409.945135] sd 1:0:0:0: [sdb] [832409.945137] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [832409.945138] sd 1:0:0:0: [sdb] CDB: [832409.945139] Read(10): 28 00 2c e1 fc 00 00 00 08 00 [832409.945146] end_request: I/O error, dev sdb, sector 753007616 [832411.116554] sd 1:0:0:0: [sdb] Unhandled error code [832411.116558] sd 1:0:0:0: [sdb] [832411.116560] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [832411.116562] sd 1:0:0:0: [sdb] CDB: [832411.116563] Read(10): 28 00 2c e1 fc 00 00 00 08 00 [832411.116571] end_request: I/O error, dev sdb, sector 753007616 [832411.123150] sd 1:0:0:0: [sdb] Unhandled error code [832411.123152] sd 1:0:0:0: [sdb] [832411.123153] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [832411.123155] sd 1:0:0:0: [sdb] CDB: [832411.123156] Read(10): 28 00 2c e1 fc 00 00 00 08 00 [832411.123162] end_request: I/O error, dev sdb, sector 753007616 [832449.398075] sd 1:0:0:0: [sdb] Unhandled error code [832449.398081] sd 1:0:0:0: [sdb] [832449.398084] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [832449.398086] sd 1:0:0:0: [sdb] CDB: [832449.398088] Read(10): 28 00 2c e1 fc 00 00 00 08 00 [832449.398099] end_request: I/O error, dev sdb, sector 753007616 [832449.404855] sd 1:0:0:0: [sdb] Unhandled error code [832449.404858] sd 1:0:0:0: [sdb] [832449.404860] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [832449.404862] sd 1:0:0:0: [sdb] CDB: [832449.404863] Read(10): 28 00 2c e1 fc 00 00 00 08 00 [832449.404872] end_request: I/O error, dev sdb, sector 753007616
That makes this two out of three new nodes having sdb hw issues.
First errors in dmesg:
[831652.740509] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[831652.749174] ata2.00: failed command: READ FPDMA QUEUED
[831652.755063] ata2.00: cmd 60/00:00:00:a8:d8/04:00:50:00:00/40 tag 0 ncq 524288 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[831652.771830] ata2.00: status: { DRDY }
[831652.776049] ata2.00: failed command: READ FPDMA QUEUED
[831652.781891] ata2.00: cmd 60/00:08:00:0c:ab/04:00:50:00:00/40 tag 1 ncq 524288 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[831652.798586] ata2.00: status: { DRDY }Worth checking if dmesg on 1009 shows similar issues.
we've seen more sdb errors on restbase1008 too, but nothing on sda so far. To rule out further things like driver issues/bugs and so on I've upgraded 1007/8/9 to 3.19 which we have to do anyway and launched smart extended tests on all disks, not that it will be conclusive but at least the disk self-scans itself
Regarding the two non-hardware blockers, (the bootstrap/streaming failures, and metrics reporting):
https://issues.apache.org/jira/browse/CASSANDRA-9624
https://issues.apache.org/jira/browse/CASSANDRA-9625
Change 219496 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add restbase1009 to cassandra::seeds
Change 219496 merged by Filippo Giunchedi:
cassandra: add restbase1009 to cassandra::seeds
With 2.1.7 on both 1006 and 1009 the first ~400G stream succeeds, but for some reason some other part of the stream fails right after:
INFO [StreamReceiveTask:55] 2015-06-20 23:48:37,574 StreamResultFuture.java:180 - [Stream #b9e17e80-16db-11e5-8044-b35b54affa2e] Session with /10.64.48.100 is complete
INFO [CompactionExecutor:482] 2015-06-20 23:48:37,575 ColumnFamilyStore.java:906 - Enqueuing flush of compactions_in_progress: 404 (0%) on-heap, 348 (0%) off-heap
INFO [MemtableFlushWriter:286] 2015-06-20 23:48:37,576 Memtable.java:346 - Writing Memtable-compactions_in_progress@999280685(270 serialized bytes, 11 ops, 0%/0% of on/off-heap limit)
WARN [StreamReceiveTask:55] 2015-06-20 23:48:37,577 StreamResultFuture.java:207 - [Stream #b9e17e80-16db-11e5-8044-b35b54affa2e] Stream failed
ERROR [main] 2015-06-20 23:48:37,578 CassandraDaemon.java:541 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1137) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:927) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:723) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:605) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:524) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:613) [apache-cassandra-2.1.7.jar:2.1.7]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[apache-cassandra-2.1.7.jar:2.1.7]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:409) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:682) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:645) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:139) ~[apache-cassandra-2.1.7.jar:2.1.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_79]During the second bootstrap attempt restbase1009 now suffered (almost) the same disk failure fate as its brethren; only difference is that this time sda is failing:
[186963.425817] blk_update_request: I/O error, dev sda, sector 1317209424 [186963.433194] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [186963.433197] sd 0:0:0:0: [sda] CDB: [186963.433199] Read(10): 28 00 4e 83 05 50 00 00 08 00
To me this is starting to look a lot like a bad batch of SSDs.
At this point, 2.1.7 is overall looking like a regression (still can't bootstrap, metrics don't work); I suggest we downgrade to 2.1.3, and regroup after the new hardware has been sorted.
Now that restbase1008 is fitted with different disks, I propose the following course of action:
Obviously, in an ideal world, we'd never go near production with 3a or 3b, but without the ability to reproduce this problem in the test environment, we're left with little option.
I went ahead and started cassandra 2.1.3 on restbase1008, the bootstrap/streaming failure is there from 1004
restbase1004
ERROR [STREAM-IN-/10.64.32.178] 2015-06-24 21:10:37,750 StreamSession.java:477 - [Stream #6858a1f0-1ab5-11e5-9171-3fcfb96209d9] Streaming error occurred
java.lang.NullPointerException: null
at org.apache.cassandra.io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:1277) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.getSSTableSectionsForRanges(StreamSession.java:313) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.addTransferRanges(StreamSession.java:266) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:493) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:425) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:251) ~[apache-cassandra-2.1.3.jar:2.1.3]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
INFO [STREAM-IN-/10.64.32.178] 2015-06-24 21:10:37,755 StreamResultFuture.java:180 - [Stream #6858a1f0-1ab5-11e5-9171-3fcfb96209d9] Session with /10.64.32.178 is complete
WARN [STREAM-IN-/10.64.32.178] 2015-06-24 21:10:37,755 StreamResultFuture.java:207 - [Stream #6858a1f0-1ab5-11e5-9171-3fcfb96209d9] Stream failed
INFO [GossipStage:1] 2015-06-24 21:10:39,764 Gossiper.java:930 - InetAddress /10.64.32.178 is now DOWNand restbase1008
INFO [STREAM-IN-/10.64.32.160] 2015-06-24 21:10:37,750 StreamResultFuture.java:180 - [Stream #6858a1f0-1ab5-11e5-9171-3fcfb96209d9] Session with /10.64.32.160 is complete
WARN [STREAM-IN-/10.64.32.160] 2015-06-24 21:10:37,752 StreamResultFuture.java:207 - [Stream #6858a1f0-1ab5-11e5-9171-3fcfb96209d9] Stream failed
ERROR [main] 2015-06-24 21:10:37,760 CassandraDaemon.java:482 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1036) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:839) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:635) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:518) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:465) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:554) [apache-cassandra-2.1.3.jar:2.1.3]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[apache-cassandra-2.1.3.jar:2.1.3]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:387) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.sessionFailed(StreamSession.java:592) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:447) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:251) ~[apache-cassandra-2.1.3.jar:2.1.3]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_79]
INFO [StorageServiceShutdownHook] 2015-06-24 21:10:37,763 Gossiper.java:1341 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2015-06-24 21:10:39,763 MessagingService.java:707 - Waiting for messaging service to quiesce
INFO [ACCEPT-/10.64.32.178] 2015-06-24 21:10:39,765 MessagingService.java:955 - MessagingService has terminated the accept() threadThe most recent attempt at bootstrapping 1008 failed early this morning, after/during the first streaming session with 1004. Both nodes were running Cassandra 2.1.7.
From 1008:
INFO [StreamReceiveTask:57] 2015-06-26 06:48:53,061 StreamResultFuture.java:180 - [Stream #8bdeb1b0-1ad2-11e5-abd8-3fcfb96209d9] Session with /10.64.32.160 is complete
WARN [StreamReceiveTask:57] 2015-06-26 06:48:53,062 StreamResultFuture.java:207 - [Stream #8bdeb1b0-1ad2-11e5-abd8-3fcfb96209d9] Stream failed
INFO [CompactionExecutor:2885] 2015-06-26 06:48:53,062 ColumnFamilyStore.java:906 - Enqueuing flush of compactions_in_progress: 428 (0%) on-heap, 379 (0%) off-heap
INFO [MemtableFlushWriter:959] 2015-06-26 06:48:53,063 Memtable.java:346 - Writing Memtable-compactions_in_progress@1203013482(294 serialized bytes, 12 ops, 0%/0% of on/off-heap limit)
ERROR [main] 2015-06-26 06:48:53,063 CassandraDaemon.java:541 - Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1137) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:927) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:723) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:605) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:524) [apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:613) [apache-cassandra-2.1.7.jar:2.1.7]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) ~[apache-cassandra-2.1.7.jar:2.1.7]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) ~[guava-16.0.jar:na]
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:409) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:682) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:645) ~[apache-cassandra-2.1.7.jar:2.1.7]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:139) ~[apache-cassandra-2.1.7.jar:2.1.7]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_79]From 1004:
INFO [STREAM-IN-/10.64.32.178] 2015-06-26 06:48:53,060 StreamResultFuture.java:180 - [Stream #8bdeb1b0-1ad2-11e5-abd8-3fcfb96209d9] Session with /10.64.32.178 is complete INFO [STREAM-IN-/10.64.32.178] 2015-06-26 06:48:53,064 StreamResultFuture.java:212 - [Stream #8bdeb1b0-1ad2-11e5-abd8-3fcfb96209d9] All sessions completed
This mirrors the past 2.1.7-to-2.1.7 attempt.
Neither machine has any disk errors that I can see.
OK, so catching up with where we are according to The Plan:
Having failed to bootstrap with 2.1.3 and 2.1.7, we are now at step #3.
Originally I proposed that we troubleshoot the bootstrapping issue using 2.1.3, on the basis that we could defer tackling the metrics regression. I'm amending that recommendation now to 2.1.7. My rationale:
This of course means that we are going to need an alternative to the graphite metrics reporter.
Change 221763 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: testing a higher a reporting interval
Change 221763 merged by Filippo Giunchedi:
cassandra: testing a higher a reporting interval
One reoccurring issue seems to manifest as sharply increasing pending tasks for memtable flush and mutation stages:
Eventually this results in increasing GC activity, followed by death.
This would seem to be IO related, but I'd like to get a better idea what is happening, and would request that anyone encountering a node in such a condition attempt to obtain stack and heap dumps, prior to restarting.
JAVA_VERSION=8 sudo -u cassandra /usr/lib/jvm/java-$JAVA_VERSION-openjdk-amd64/bin/jstack -J-d64 -l -m `pgrep -f CassandraDaemon` > /path/to/thread-dump-`date +%s`.log sudo -u cassandra /usr/lib/jvm/java-$JAVA_VERSION-openjdk-amd64/bin/jmap -dump:format=b,file=/path/to/heap-`date +%s`.bin `pgrep -f CassandraDaemon`
It would appear that with gc_grace_seconds set to 0, and deletes happening at consistency ONE, that some deletes aren't sticking.
$ zcat wikipedia_T_parsoid_html.log.1435873391.gz | tail -n 25 -- deleting: ["sh.wikipedia.org","San_Juan_del_Valle,_Chilchotla",13830580] Fri May 08 2015 12:23:55 GMT+0000 (UTC) -- deleting: ["sh.wikipedia.org","San_Juan_del_Valle,_Chilchotla",13790428] Fri May 08 2015 04:48:32 GMT+0000 (UTC) -- deleting: ["sh.wikipedia.org","San_Juan_del_Valle,_Chilchotla",13620095] Wed May 06 2015 19:07:52 GMT+0000 (UTC) -- deleting: ["sh.wikipedia.org","San_Juan_del_Valle,_Chilchotla",13532196] Wed May 06 2015 01:10:53 GMT+0000 (UTC) -- deleting: ["sh.wikipedia.org","San_Juan_del_Valle,_Chilchotla",13434471] Tue May 05 2015 04:29:43 GMT+0000 (UTC) -- deleting: ["sh.wikipedia.org","San_Juan_del_Valle,_Chilchotla",13274729] Sun May 03 2015 21:01:47 GMT+0000 (UTC) -- deleting: ["en.wikipedia.org","1953_Iranian_coup_d'état",657303897] Tue May 12 2015 00:12:05 GMT+0000 (UTC) -- deleting: ["en.wikipedia.org","Vagus_nerve",654533006] Tue May 05 2015 09:50:03 GMT+0000 (UTC) ...
cassandra@cqlsh:local_group_wikipedia_T_parsoid_html> select key,rev,dateOf(tid),tid from data where "_dmain" = 'sh.wikipedia.org' and key = 'San_Juan_del_Valle,_Chilchotla' and rev = 13790428; key | rev | dateOf(tid) | tid --------------------------------+----------+--------------------------+-------------------------------------- San_Juan_del_Valle,_Chilchotla | 13790428 | 2015-05-08 04:48:32+0000 | 7a40b404-f53d-11e4-bb02-e72a65a01436 (1 rows) cassandra@cqlsh:local_group_wikipedia_T_parsoid_html> CONSISTENCY ALL Consistency level set to ALL. cassandra@cqlsh:local_group_wikipedia_T_parsoid_html> select key,rev,dateOf(tid),tid from data where "_dmain" = 'sh.wikipedia.org' and key = 'San_Juan_del_Valle,_Chilchotla' and rev = 13790428 and tid = 7a40b04-f53d-11e4-bb02-e72a65a01436; _domain | key | rev | tid | _del | content-location | content-sha256 | content-type | latestTid | tags | value ---------+-----+-----+-----+------+------------------+----------------+--------------+-----------+------+------- (0 rows)
Of the dozen or so I've checked so far, at least half were like this.
A lot of the issues we encountered were caused by T105509. Since deploying the RESTBase hotfix to that issue, compaction has started to catch up, latencies are recovering, timeouts are rare again and storage space is starting to shrink.
We still need to reduce the instance size going forward to make sure that we can bootstrap & can actually drive the hardware fully without hitting GC limits. However, we have also improved the setup a bit (switched to G1GC, hardened concurrency limits against overload), which might have possibly pushed up the absolute limit very slightly for our use case.
Change 227475 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add restbase1007
bootstrapping restbase1007 was successful!
15:47 -icinga-wm:#wikimedia-operations- RECOVERY - Cassanda CQL query interface on restbase1007 is OK: TCP OK - 0.005 second response time on port 9042 restbase1007:~$ nodetool status Datacenter: eqiad ================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.64.32.160 822.2 GB 256 ? 798ff758-8c91-46e0-b85e-dad356c46f20 b UN 10.64.48.99 757.26 GB 256 ? 325e01e8-debe-45f0-a8c2-93b3baa58968 d UN 10.64.48.100 714.7 GB 256 ? 2abf437d-a16d-406b-a6de-8d28b7dda808 d UN 10.64.0.220 725.13 GB 256 ? c021a198-b7f1-4dc2-94d7-9cb8b8a8df28 a UN 10.64.0.221 795.92 GB 256 ? fc041cc8-cd28-4030-b29a-05b9a632cafc a UN 10.64.32.159 673.18 GB 256 ? 88d9ef9f-d81b-466e-babf-6a283b13f648 b UN 10.64.0.223 437.24 GB 256 ? c1b5a012-4840-4096-9a71-ce4d3afb0029 a
/me high-fives @fgiunchedi and @Eevans!
It took a bit longer than planned to add that node, but it's all good in the end ;)
Change 227994 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add restbase1008
1008 has now joined the cluster as well:
Datacenter: eqiad ================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN restbase1005.eqiad.wmnet 763.29 GB 256 ? 325e01e8-debe-45f0-a8c2-93b3baa58968 d UN restbase1003.eqiad.wmnet 672.61 GB 256 ? 88d9ef9f-d81b-466e-babf-6a283b13f648 b UN restbase1007.eqiad.wmnet 390.78 GB 256 ? c1b5a012-4840-4096-9a71-ce4d3afb0029 a UN restbase1002.eqiad.wmnet 796.21 GB 256 ? fc041cc8-cd28-4030-b29a-05b9a632cafc a UN restbase1006.eqiad.wmnet 711.87 GB 256 ? 2abf437d-a16d-406b-a6de-8d28b7dda808 d UN restbase1001.eqiad.wmnet 725.37 GB 256 ? c021a198-b7f1-4dc2-94d7-9cb8b8a8df28 a UN restbase1008.eqiad.wmnet 500.7 GB 256 ? e9ab408a-309e-4e93-b145-9ac5f2365523 b UN restbase1004.eqiad.wmnet 812.06 GB 256 ? 798ff758-8c91-46e0-b85e-dad356c46f20 b
Change 229095 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add restbase1009
indeed!
restbase1009 is bootstrapping
root@restbase1009:~# nodetool netstats | grep -v 100%
Mode: JOINING
Bootstrap 6b6ac0f0-3a93-11e5-8b42-63ed2935445e
/10.64.0.221
/10.64.32.178
/10.64.48.100
Receiving 2662 files, 266574138067 bytes total. Already received 10 files, 414282277 bytes total
/var/lib/cassandra/data/local_group_wiktionary_T_parsoid_dataW4ULtxs1oMq/data-8198e0d0fe3111e492d527cac5e63599/local_group_wiktionary_T_parsoid_dataW4ULtxs1oMq-data-tmp-ka-10-Data.db 8844856/12859749 bytes(68%) received from idx:0/10.64.48.100
/10.64.0.220
/10.64.0.223
/10.64.32.159
/10.64.32.160
/10.64.48.99
Receiving 2899 files, 296637105677 bytes total. Already received 6 files, 435622852 bytes total
/var/lib/cassandra/data/local_group_wiktionary_T_parsoid_dataW4ULtxs1oMq/data-8198e0d0fe3111e492d527cac5e63599/local_group_wiktionary_T_parsoid_dataW4ULtxs1oMq-data-tmp-ka-7-Data.db 83140935/137981086 bytes(60%) received from idx:0/10.64.48.99
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed
Commands n/a 0 9
Responses n/a 0 9011
root@restbase1009:~#..aand we are complete:
Datacenter: eqiad ================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN restbase1004.eqiad.wmnet 803.24 GB 256 ? 798ff758-8c91-46e0-b85e-dad356c46f20 b UN restbase1008.eqiad.wmnet 478.13 GB 256 ? e9ab408a-309e-4e93-b145-9ac5f2365523 b UN restbase1005.eqiad.wmnet 771.36 GB 256 ? 325e01e8-debe-45f0-a8c2-93b3baa58968 d UN restbase1006.eqiad.wmnet 709.21 GB 256 ? 2abf437d-a16d-406b-a6de-8d28b7dda808 d UN restbase1001.eqiad.wmnet 721.87 GB 256 ? c021a198-b7f1-4dc2-94d7-9cb8b8a8df28 a UN restbase1002.eqiad.wmnet 789.45 GB 256 ? fc041cc8-cd28-4030-b29a-05b9a632cafc a UN restbase1009.eqiad.wmnet 523.07 GB 256 ? 3ee62592-ef41-445a-863f-b05be9ae5ac8 d UN restbase1003.eqiad.wmnet 666.9 GB 256 ? 88d9ef9f-d81b-466e-babf-6a283b13f648 b UN restbase1007.eqiad.wmnet 414.1 GB 256 ? c1b5a012-4840-4096-9a71-ce4d3afb0029 a
/me high-fives @fgiunchedi and @Eevans
I'm really glad that we finally made it.
indeed, I've pooled restbase100[789] in pybal for restbase too so they are fully in service now