Better response times Pageview API, better throughput.Scaling.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +111 -0 | Add a new AQS testing environment to play with Cassandra settings before production. |
Event Timeline
I briefly looked at the Cassandra cluster, and the typical column family read latency (this just the time Cassandra records for local storage-system reads) is quite high.
At least some component of this is the number of SSTables/read. This column family (apparently typical), hits 10 SSTables to satisfy reads 50p, and a whopping 24 99p. That's quite a lot, rotational disks notwithstanding.
I did an analysis of DateTieredCompactionStrategy in the RESTBase cluster over here, and found that out-of-order writes are flattening the date-based structure, leaving us with a single, all-encompassing tier. Since DTCS does STCS within tiers, we are effectively running STCS. The situation is similar for AQS.
It would seem that upgrades to SSDs are in the works, which should definitely help.
Other options to improve Cassandra read performance:
- TimeWindowCompactionStrategy (CASSANDRA-9666)
This compaction strategy is currently out-of-tree (though that will likely change); We'd need to build and deploy the jars ourselves. That said, it has seen surprisingly wide adoption, likely because it's algorithm is much closer to what people expect of DTCS, and is easier to reason about. For us, it should solve the problem of out-of-order writes.
- Lower node density (more Cassandra nodes / multi-instance)
This is likely going to be necessary at some point anyway. These nodes have 10T of raw storage, and running single instances of Cassandra this large is going to get painful (think repair, bootstraps, decommission, etc). Adding more nodes decreases the per-node share of data, resulting in fewer (smaller) SSTables. Particularly if the compaction algorithm isn't behaving ideally, this will do a lot to reduce read latency
With new nodes on the way, now is really the time to think about this. We could do the evaluation of (1) using one of the new nodes in a write sampling mode, and a local TWCS override. If it passes muster, then bootstrapping the new nodes with TWCS enabled would save a recompaction later. And, moving to multi-instance (2) would be straightforward when adding the new nodes, far less so down the road.
Happy to help with all this; Let me know!
Removing pageview-API tag to prevent having automatic Analytics tagging (this belongs to Analytics-Kanban now)
@Nemo_bis : there are many issues among them tow main ones: 1) we need new hardware and 2) we need to bound queries to a date range that we can return w/o running into timeouts. https://phabricator.wikimedia.org/T134524
Change 288373 had a related patch set uploaded (by Elukey):
Add a new AQS testing environment to play with Cassandra settings before production.
I wanted to speak up and say that I get the throttling notice. I need to use tools to make article traffic queries. It is not a problem for me to wait 15 seconds but when I see an alert like this, it makes me a little afraid that the service is not watched and could go down entirely.
He's speaking of the notice at https://tools.wmflabs.org/massviews (or Langviews), both make batch requests, the former up to 500.
Don't worry @Bluerasberry, it's not going to go down. The throttling I added I think ensures you get all the data you need, and the notice is there just to keep users are informed of the issue and that the kind folks on the Analytics team are working on it :)
@Nemo_bis: system sends 500 when it doesn't work, not when there is a controlled error condition
Change 288373 merged by Elukey:
Add a new AQS testing environment to play with Cassandra settings before production.
system sends 500 when it doesn't work, not when there is a controlled error condition
Exceeding capacity seems a standard condition to take into account, and certainly one that happens frequently for this service (and any service that queries so much data, I guess).
Summary of the last months of work:
- The new AQS cluster has been built and it is up and running. All the technical details have been collected in https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/AQS_Settings (still need to be completed but almost done). All the puppet work is completed.
- The new cluster has been load tested by Joseph in https://etherpad.wikimedia.org/p/analytics-aqs-cassandra and we obtained good performances compared to the actual ones. We are striving for at least one order of magnitude more req/s performances on the new cluster, and for the moment it seems that we are going to meet the expectations.
- We decided to go for the safest migration path, namely creating and loading data to the new AQS cluster without leveraging Cassandra cluster expansion features. This will allow us to have both new and old AQS clusters up and running and completely independent from each other at migration time, in order to ensure a clean and smooth rollback via confd/pybal/lvs in case an unexpected regression will appear.
- One major problem that we have been working on (Joseph leading the efforts again) is bulk loading the new cluster with one year of pageviews data. Cassandra offers a new experimental feature to create SSTables in Hadoop and to stream them directly to the cluster. Tests revealed that pure loading time has been cut of by a factor of 5, but SSTable compaction takes more time than expected. Joseph is currently working on it to figure out if it is possible to improve the compaction step.
Is this an apples-to-apples comparison using the same compaction strategy? In other words (as I understand it) you are using leveled compaction now, does the sstableloadermethod create greater compaction load than the previous method did with leveled compaction?
@Eevans : It is indeed using the same compaction strategy (Leveled). I assume it is because of while-loading-presorting or something like that that the difference is so big. I have not managed to find a parameter tuning the size of sstables generated by the loaders and I the feeling they are too small (~10Mb) and therefore incur the compaction thing, but I could be wrong.
After a week of almost no progress on a month of bulk loaded data to compact, we wiped the cluster.
We are eager to use the new cluster in prod, and loading takes time, so we decided to with CQL loading even if it's more expensive at load time (it seems really cheaper at compaction time though).
Status update:
We loaded four months of data to the cluster and fixed some misconfigurations (like sstable compression) but after a review of the disks configuration with the Operations team, we decided to move the Cassandra instances' disk partitions to RAID10 arrays (rather than RAID0). This choice will allow us to be more resilient to disk/host failures in the longer term, but it will delay some days the cluster's data backfill of course.
We have also reached out to the Analytics community to know which use cases would require more data than the past 18 months. This will allow us to make a better prediction of how much disk space we'll need during the next months to satisfy our users' needs.
The plan is to re-image one host at the time (so two Cassandra instances down maximum at the same time) and then wait for Cassandra recovery/bootstrapping to bring the new instances up to speed again. From our calculations this work should be finished early next week, but I'll keep this task updated.
Status update:
We have re-imaged all the Cassandra hosts (aqs100[456]) with the new RAID settings and started the data load. We are currently very close to finish loading 14 months of data, optimistically by the end of this week. We have also improved user management moving away from the default cassandra user for everything (T142073) and investigated if anything could be done to improve the performances of the current cluster while waiting for the new one (T143873).
Next steps:
- complete the data load (optimistically end of this week)
- consistency checks about data loaded and Cassandra/Restbase settings (will require a couple of days max)
- final load tests to make sure that the cluster will be able to sustain correctly the current traffic (will require 3/4 days max)
- Cluster switch - we'll add the new nodes behind the AQS LVS load balancer leaving the current ones running, and then we'll deprecate old hosts if nothing major appears during the following days.
Given the above timeline we should be able to make the switch before the end of the quarter.
Change 310831 had a related patch set uploaded (by Elukey):
Add the new aqs nodes to conftool-data