Page MenuHomePhabricator

Test cassandra compactions on new AQS nodes
Closed, ResolvedPublic21 Story Points

Description

With SSDs cassandra supports leveled compaction, which provides best performance for read, so we go with that.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 12 2016, 4:35 PM
Nuria added a subscriber: Nuria.EditedMay 12 2016, 5:25 PM
  • Basic puppet configuration /deploy /
  • Load subset of data data into cassandra (3 months). Note that problem occurs when volumes get bigger
  • Test (via restbase hopefully) queries

What is our success criteria?

  • Compaction (when loading). The more tables you have to read to grab 1 row of data the more work system has to do. The best compaction strategy for best reading timesmakes system look in fewest tables as possible when reading a result.
  • Read requests response times

How are we going to get the perf data?

Will they be published to graphana? (we need to verify this 1st of all)

Nuria added a comment.May 12 2016, 5:41 PM

Puppet is almost done, tasked assuming that metrics show up on graphana, otherwise we need more work

mforns set the point value for this task to 21.May 12 2016, 5:47 PM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.May 16 2016, 3:51 PM

Change 289224 had a related patch set uploaded (by Elukey):
Add aqs100[456] to the list of production hosts.

https://gerrit.wikimedia.org/r/289224

Change 289224 merged by Elukey:
Add aqs100[456] to the list of production hosts.

https://gerrit.wikimedia.org/r/289224

Change 289424 had a related patch set uploaded (by Elukey):
Fix typo in IP address settings for aqs1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/289424

Change 289424 merged by Elukey:
Fix typo in IP address settings for aqs1005.eqiad.wmnet

https://gerrit.wikimedia.org/r/289424

Cluster up and running!

elukey@aqs1006:~$ nodetool-a status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.64.48.148  190.18 KB  256     35.2%             da129795-421f-439b-bd29-6a4cd9f18813  rack1
UN  10.64.48.149  87.18 KB   256     33.6%             e28f73cd-93c6-47e6-b046-8bbf801389f6  rack1
UN  10.64.32.189  140.09 KB  256     32.5%             f05db2ca-61c4-4324-8f9a-d11d3cf66e95  rack1
UN  10.64.0.126   189.93 KB  256     32.8%             af353a9f-0dd4-41f1-8a08-b1c7e57b2c68  rack1
UN  10.64.32.190  54.64 KB   256     32.8%             571af44e-23c3-4140-b59c-66fbdc16af6a  rack1
UN  10.64.0.127   78.15 KB   256     33.0%             06dc704b-b39b-4d2a-8d9e-81368163221f  rack1

Also metrics are flowing in ganglia:

https://ganglia.wikimedia.org/latest/?c=Analytics%20Query%20Service%20eqiad&m=mem_report&r=week&s=by%20name&hc=4&mc=2

elukey added a subscriber: Eevans.May 18 2016, 3:39 PM

Verified that we have Ganglia/Graphite metrics for the new hosts (we also have metrics for every cassandra instance).

@MoritzMuehlenhoff created https://gerrit.wikimedia.org/r/#/c/289830/ to fix ferm rules since @JAllemandou failed to load data into Cassandra from Hadoop (example https://gist.github.com/anonymous/41781b19ebcba233d029bcab5e76c304).

So we'll see next week how to proceed, next step is add data to the cluster.

Change 292568 had a related patch set uploaded (by Elukey):
Remove old and redundant AQS specific alarms.

https://gerrit.wikimedia.org/r/292568

Change 292568 merged by Elukey:
Remove old and redundant AQS specific alarms.

https://gerrit.wikimedia.org/r/292568

JAllemandou updated the task description. (Show Details)
Nuria closed this task as Resolved.Jul 4 2016, 4:27 PM