Page MenuHomePhabricator

Inconsistent Cassandra disk load shown in metrics and nodetool status
Open, Needs TriagePublic

Description

We are finally ready to release the new aqs cluster (aqs100[456]) but something weird happened today and before proceeding I'd like to double check with you.

aqs1004 is taking AQS live traffic with aqs100[123], but the Cassandra clusters are of course separate and not talking with each other (they have been loaded with the same data). I restarted Cassandra aqs1004 (a and b instances) for T130861 and disk load metrics acted in a very weird way:

I checked metrics on aqs1004 and the new values seems consistent:

elukey@neodymium:~$ sudo -i salt aqs100[456]* cmd.run 'du -hs /srv/cassandra*'
aqs1004.eqiad.wmnet:
    639G       	/srv/cassandra-a
    657G       	/srv/cassandra-b
aqs1006.eqiad.wmnet:
    678G       	/srv/cassandra-a
    623G       	/srv/cassandra-b
aqs1005.eqiad.wmnet:
    658G       	/srv/cassandra-a
    663G       	/srv/cassandra-b

elukey@aqs1004:~$ nodetool-a status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.64.48.148  3.84 TB    256          51.9%             ec437eff-af17-4863-b6ff-42f87ea86557  rack3
UN  10.64.48.149  2.41 TB    256          48.1%             4d24db1d-fc2a-4ec9-9d43-3952d480ff7e  rack3
UN  10.64.32.189  3.23 TB    256          50.2%             de1f9797-9ee0-472f-9713-e9bc3c8a1949  rack2
UN  10.64.32.190  3.06 TB    256          49.8%             38b46448-a547-4a4f-9e96-35a0e28ee796  rack2
UN  10.64.0.126   638.41 GB  256          50.0%             a6c7480a-7f94-4488-a925-0cff98c5841a  rack1
UN  10.64.0.127   655 GB     256          50.0%             ed33d9e1-a654-4ca6-a232-bf97f32206ba  rack1

elukey@aqs1004:~$ nodetool-a cfstats | grep "Space used (total)" | grep -v ": 0"
       		Space used (total): 4474407412
       		Space used (total): 605565974
       		Space used (total): 549343
       		Space used (total): 9770
       		Space used (total): 20457
       		Space used (total): 65249
       		Space used (total): 4763
       		Space used (total): 16552
       		Space used (total): 60427
       		Space used (total): 4763
       		Space used (total): 10551
       		Space used (total): 4763
       		Space used (total): 4763
       		Space used (total): 624670
       		Space used (total): 1016536
       		Space used (total): 1571796
       		Space used (total): 680402483107
       		Space used (total): 9855
       		Space used (total): 4888

But on aqs1005 (still to restart) then I can see:

Keyspace: local_group_default_T_pageviews_per_article_flat
       	Read Count: 672810
       	Read Latency: 4.88949817779165 ms.
       	Write Count: 13687087432
       	Write Latency: 0.053604304066448956 ms.
       	Pending Flushes: 0
       		Table: data
       		SSTable count: 4038
       		SSTables in each level: [0, 10, 108/100, 1059/1000, 2861, 0, 0, 0, 0]
       		Space used (live): 3551215773006
       		Space used (total): 3557234299432

Is it a normal behavior? I am 99% sure that this is not a big deal but I'd like to be sure before proceeding :)

Thanks!

Luca

Event Timeline

elukey created this task.Sep 20 2016, 9:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2016, 9:11 AM
elukey renamed this task from Incosinstent Cassandra disk load shown in metrics and nodetool status to Inconsistent Cassandra disk load shown in metrics and nodetool status.Sep 20 2016, 9:12 AM

Mentioned in SAL (#wikimedia-operations) [2016-09-20T16:32:28Z] <elukey> restarting cassandra on aqs100[56] (started the work earlier on today, stopped due to T146130)

All cassandra instances restarted, the behavior outlined in the task's description recurred. As agreed with @Eevans we are going to keep it monitored to see if it reoccurs.

Nuria moved this task from Incoming to Blocked on the Analytics board.Sep 26 2016, 3:40 PM
Nuria moved this task from Blocked to Radar on the Analytics board.Apr 25 2017, 7:50 PM

@elukey, is this still a thing?