Page MenuHomePhabricator

Inconsistent Cassandra disk load shown in metrics and nodetool status
Closed, ResolvedPublic

Description

We are finally ready to release the new aqs cluster (aqs100[456]) but something weird happened today and before proceeding I'd like to double check with you.

aqs1004 is taking AQS live traffic with aqs100[123], but the Cassandra clusters are of course separate and not talking with each other (they have been loaded with the same data). I restarted Cassandra aqs1004 (a and b instances) for T130861 and disk load metrics acted in a very weird way:

Screen Shot 2016-09-20 at 10.57.06 AM.png (648×1 px, 60 KB)

I checked metrics on aqs1004 and the new values seems consistent:

elukey@neodymium:~$ sudo -i salt aqs100[456]* cmd.run 'du -hs /srv/cassandra*'
aqs1004.eqiad.wmnet:
    639G       	/srv/cassandra-a
    657G       	/srv/cassandra-b
aqs1006.eqiad.wmnet:
    678G       	/srv/cassandra-a
    623G       	/srv/cassandra-b
aqs1005.eqiad.wmnet:
    658G       	/srv/cassandra-a
    663G       	/srv/cassandra-b

elukey@aqs1004:~$ nodetool-a status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.64.48.148  3.84 TB    256          51.9%             ec437eff-af17-4863-b6ff-42f87ea86557  rack3
UN  10.64.48.149  2.41 TB    256          48.1%             4d24db1d-fc2a-4ec9-9d43-3952d480ff7e  rack3
UN  10.64.32.189  3.23 TB    256          50.2%             de1f9797-9ee0-472f-9713-e9bc3c8a1949  rack2
UN  10.64.32.190  3.06 TB    256          49.8%             38b46448-a547-4a4f-9e96-35a0e28ee796  rack2
UN  10.64.0.126   638.41 GB  256          50.0%             a6c7480a-7f94-4488-a925-0cff98c5841a  rack1
UN  10.64.0.127   655 GB     256          50.0%             ed33d9e1-a654-4ca6-a232-bf97f32206ba  rack1

elukey@aqs1004:~$ nodetool-a cfstats | grep "Space used (total)" | grep -v ": 0"
       		Space used (total): 4474407412
       		Space used (total): 605565974
       		Space used (total): 549343
       		Space used (total): 9770
       		Space used (total): 20457
       		Space used (total): 65249
       		Space used (total): 4763
       		Space used (total): 16552
       		Space used (total): 60427
       		Space used (total): 4763
       		Space used (total): 10551
       		Space used (total): 4763
       		Space used (total): 4763
       		Space used (total): 624670
       		Space used (total): 1016536
       		Space used (total): 1571796
       		Space used (total): 680402483107
       		Space used (total): 9855
       		Space used (total): 4888

But on aqs1005 (still to restart) then I can see:

Keyspace: local_group_default_T_pageviews_per_article_flat
       	Read Count: 672810
       	Read Latency: 4.88949817779165 ms.
       	Write Count: 13687087432
       	Write Latency: 0.053604304066448956 ms.
       	Pending Flushes: 0
       		Table: data
       		SSTable count: 4038
       		SSTables in each level: [0, 10, 108/100, 1059/1000, 2861, 0, 0, 0, 0]
       		Space used (live): 3551215773006
       		Space used (total): 3557234299432

Is it a normal behavior? I am 99% sure that this is not a big deal but I'd like to be sure before proceeding :)

Thanks!

Luca

Event Timeline

elukey renamed this task from Incosinstent Cassandra disk load shown in metrics and nodetool status to Inconsistent Cassandra disk load shown in metrics and nodetool status.Sep 20 2016, 9:12 AM

Mentioned in SAL (#wikimedia-operations) [2016-09-20T16:32:28Z] <elukey> restarting cassandra on aqs100[56] (started the work earlier on today, stopped due to T146130)

All cassandra instances restarted, the behavior outlined in the task's description recurred. As agreed with @Eevans we are going to keep it monitored to see if it reoccurs.

Eevans claimed this task.

@elukey, is this still a thing?

I'm going to assume this isn't, and boldly close; Please reopen if necessary