We are finally ready to release the new aqs cluster (aqs100[456]) but something weird happened today and before proceeding I'd like to double check with you.
aqs1004 is taking AQS live traffic with aqs100[123], but the Cassandra clusters are of course separate and not talking with each other (they have been loaded with the same data). I restarted Cassandra aqs1004 (a and b instances) for T130861 and disk load metrics acted in a very weird way:
I checked metrics on aqs1004 and the new values seems consistent:
elukey@neodymium:~$ sudo -i salt aqs100[456]* cmd.run 'du -hs /srv/cassandra*'
aqs1004.eqiad.wmnet:
639G /srv/cassandra-a
657G /srv/cassandra-b
aqs1006.eqiad.wmnet:
678G /srv/cassandra-a
623G /srv/cassandra-b
aqs1005.eqiad.wmnet:
658G /srv/cassandra-a
663G /srv/cassandra-b
elukey@aqs1004:~$ nodetool-a status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.64.48.148 3.84 TB 256 51.9% ec437eff-af17-4863-b6ff-42f87ea86557 rack3
UN 10.64.48.149 2.41 TB 256 48.1% 4d24db1d-fc2a-4ec9-9d43-3952d480ff7e rack3
UN 10.64.32.189 3.23 TB 256 50.2% de1f9797-9ee0-472f-9713-e9bc3c8a1949 rack2
UN 10.64.32.190 3.06 TB 256 49.8% 38b46448-a547-4a4f-9e96-35a0e28ee796 rack2
UN 10.64.0.126 638.41 GB 256 50.0% a6c7480a-7f94-4488-a925-0cff98c5841a rack1
UN 10.64.0.127 655 GB 256 50.0% ed33d9e1-a654-4ca6-a232-bf97f32206ba rack1
elukey@aqs1004:~$ nodetool-a cfstats | grep "Space used (total)" | grep -v ": 0"
Space used (total): 4474407412
Space used (total): 605565974
Space used (total): 549343
Space used (total): 9770
Space used (total): 20457
Space used (total): 65249
Space used (total): 4763
Space used (total): 16552
Space used (total): 60427
Space used (total): 4763
Space used (total): 10551
Space used (total): 4763
Space used (total): 4763
Space used (total): 624670
Space used (total): 1016536
Space used (total): 1571796
Space used (total): 680402483107
Space used (total): 9855
Space used (total): 4888But on aqs1005 (still to restart) then I can see:
Keyspace: local_group_default_T_pageviews_per_article_flat
Read Count: 672810
Read Latency: 4.88949817779165 ms.
Write Count: 13687087432
Write Latency: 0.053604304066448956 ms.
Pending Flushes: 0
Table: data
SSTable count: 4038
SSTables in each level: [0, 10, 108/100, 1059/1000, 2861, 0, 0, 0, 0]
Space used (live): 3551215773006
Space used (total): 3557234299432Is it a normal behavior? I am 99% sure that this is not a big deal but I'd like to be sure before proceeding :)
Thanks!
Luca
