Page MenuHomePhabricator

Better response times on AQS (Pageview API mostly) {melc}
Closed, ResolvedPublic0 Estimated Story Points

Description

Better response times Pageview API, better throughput.Scaling.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I briefly looked at the Cassandra cluster, and the typical column family read latency (this just the time Cassandra records for local storage-system reads) is quite high.

screenshot-grafana wikimedia org 2016-04-13 10-20-26.png (262×1 px, 101 KB)

screenshot-grafana wikimedia org 2016-04-13 10-24-10.png (262×1 px, 110 KB)

At least some component of this is the number of SSTables/read. This column family (apparently typical), hits 10 SSTables to satisfy reads 50p, and a whopping 24 99p. That's quite a lot, rotational disks notwithstanding.

1eevans@aqs1001:~$ nodetool cfhistograms -- local_group_default_T_pageviews_per_article_flat data
2local_group_default_T_pageviews_per_article_flat/data histograms
3Percentile SSTables Write Latency Read Latency Partition Size Cell Count
4 (micros) (micros) (bytes)
550% 10.00 29.00 379022.00 1916 35
675% 14.00 35.00 1131752.00 3973 72
795% 20.00 60.00 2816159.00 20501 310
898% 20.00 72.00 3379391.00 29521 535
999% 24.00 103.00 4055269.00 51012 924
10Min 0.00 6.00 51.00 925 15
11Max 60.00 10090808.00 12108970.00 263210 4768

I did an analysis of DateTieredCompactionStrategy in the RESTBase cluster over here, and found that out-of-order writes are flattening the date-based structure, leaving us with a single, all-encompassing tier. Since DTCS does STCS within tiers, we are effectively running STCS. The situation is similar for AQS.

1eevans@aqs1001:~/compaction$ sudo -u cassandra ./print_sstables_info /var/lib/cassandra/data/local_group_default_T_pageviews_per_article_flat/data-f478740078fa11e5bfce55ce467c43aa
2 ID | Timespan | Max timestamp | Size
3 60804 | 172 days, 8:19:10.689000 | 2016-04-13 04:07:00.759000 | 1 GB
4 60550 | 172 days, 8:16:09.957000 | 2016-04-13 04:04:00.260000 | 6 GB
5 59996 | 171 days, 16:41:20.289000 | 2016-04-12 12:29:10.365000 | 22 GB
6 60147 | 171 days, 16:41:20.277000 | 2016-04-12 12:29:10.342000 | 12 GB
7 57935 | 168 days, 8:36:59.525001 | 2016-04-09 04:24:49.628001 | 63 GB
8 60849 | 167 days, 2:27:47.254000 | 2016-04-13 04:07:00.619000 | 73 MB
9 60803 | 153 days, 14:06:03.332001 | 2016-04-11 16:26:51.567001 | 2 GB
10 59246 | 153 days, 12:13:26.965999 | 2016-04-11 14:34:23.545000 | 10 GB
11 60818 | 149 days, 16:55:34.091000 | 2016-04-12 04:24:47.236000 | 3 MB
12 60834 | 149 days, 9:59:46.798000 | 2016-04-07 12:23:23.148001 | 131 MB
13 51854 | 143 days, 0:50:37.059999 | 2016-04-01 03:11:24.549000 | 100 GB
14 47637 | 126 days, 8:04:51.378001 | 2016-02-27 03:52:41.429001 | 581 GB
15 60812 | 126 days, 8:02:47.450000 | 2016-02-27 03:50:37.841000 | 567 MB
16 59975 | 104 days, 10:56:37.075000 | 2016-04-07 12:23:23.161001 | 11 GB
17 60799 | 61 days, 1:44:23.249001 | 2016-04-09 04:24:44.724001 | 1 GB
18 59976 | 61 days, 1:44:20.290999 | 2016-04-09 04:24:41.768000 | 14 GB
19 60847 | 45 days, 0:43:07.506001 | 2016-02-08 02:09:53.618001 | 1 GB
20 60844 | 45 days, 0:12:23.379000 | 2016-03-24 02:52:49.561000 | 222 MB
21 47083 | 41 days, 4:19:15.816000 | 2016-02-04 05:46:02.372000 | 140 GB
22 57257 | 12 days, 1:47:08.213001 | 2016-04-06 03:12:45.585001 | 94 GB
23 49631 | 10 days, 22:35:49.693000 | 2016-03-13 02:40:40.144000 | 79 GB
24 48462 | 10 days, 10:21:30.744998 | 2016-03-01 04:42:45.998000 | 45 GB
25 60758 | 10 days, 1:41:37.277000 | 2016-04-04 03:07:07.367000 | 5 GB
26 60843 | 10 days, 1:41:37.221999 | 2016-04-04 03:07:07.311000 | 376 MB
27 50715 | 10 days, 1:37:43.264000 | 2016-03-24 02:53:39.220000 | 48 GB
28 47081 | 9 days, 3:09:41.723000 | 2016-02-18 05:07:06.076000 | 27 GB
29 60842 | 7 days, 2:52:35.191999 | 2016-04-12 04:24:47.281000 | 189 MB
30 60537 | 6 days, 14:48:08.520999 | 2016-04-11 16:20:20.585000 | 1 GB
31 60538 | 4 days, 8:29:04.523000 | 2016-04-12 04:56:54.663001 | 1 GB
32 60841 | 4 days, 8:29:04.446001 | 2016-04-12 04:56:54.617001 | 201 MB
33 57254 | 3 days, 9:38:35.952999 | 2016-04-08 11:10:48.017000 | 39 GB
34 59105 | 2 days, 9:46:12.666000 | 2016-04-10 06:14:02.811000 | 41 GB
35 60793 | 1 day, 10:36:04.743001 | 2016-04-12 12:27:48.934001 | 276 MB
36 60790 | 1 day, 0:04:10.201999 | 2016-04-13 02:49:48.036000 | 119 MB
37 60530 | 19:19:25.116001 | 2016-04-12 04:56:54.657001 | 895 MB
38 60839 | 17:08:08.778000 | 2016-04-12 02:45:38.320001 | 67 MB
39 60151 | 9:34:01.491000 | 2016-04-12 12:19:39.828000 | 3 GB
40 60814 | 9:16:11.294000 | 2016-04-12 12:01:49.723000 | 4 MB
41 59103 | 7:43:56.605999 | 2016-04-11 09:37:33.908000 | 8 GB
42 60840 | 7:43:51.495000 | 2016-04-11 09:37:29.255001 | 10 MB
43 59629 | 6:19:17.435000 | 2016-04-11 15:56:51.341000 | 15 GB
44 60787 | 2:07:30.813000 | 2016-04-13 04:03:45.125000 | 127 MB
45 60838 | 1:52:07.249000 | 2016-04-13 04:02:02.163000 | 19 MB
46 60516 | 0:33:48.603999 | 2016-04-13 02:51:28.072000 | 657 MB
47 60837 | 0:16:08.755001 | 2016-04-13 03:30:11.231001 | 1 MB
48 60836 | 0:15:44.803000 | 2016-04-13 03:45:56.153000 | 353 KB
49 60782 | 0:10:35.988001 | 2016-04-13 03:56:32.141001 | 4 MB
50 60275 | 0:08:32.889000 | 2016-04-13 03:25:22.525001 | 146 MB
51 60417 | 0:08:06.173000 | 2016-04-13 03:38:17.665000 | 277 MB
52 60781 | 0:04:57.418000 | 2016-04-13 04:06:59.784000 | 164 KB
53 60808 | 0:01:53.391001 | 2016-04-13 01:59:54.363001 | 152 KB

It would seem that upgrades to SSDs are in the works, which should definitely help.

Other options to improve Cassandra read performance:

  1. TimeWindowCompactionStrategy (CASSANDRA-9666)

    This compaction strategy is currently out-of-tree (though that will likely change); We'd need to build and deploy the jars ourselves. That said, it has seen surprisingly wide adoption, likely because it's algorithm is much closer to what people expect of DTCS, and is easier to reason about. For us, it should solve the problem of out-of-order writes.
  1. Lower node density (more Cassandra nodes / multi-instance)

    This is likely going to be necessary at some point anyway. These nodes have 10T of raw storage, and running single instances of Cassandra this large is going to get painful (think repair, bootstraps, decommission, etc). Adding more nodes decreases the per-node share of data, resulting in fewer (smaller) SSTables. Particularly if the compaction algorithm isn't behaving ideally, this will do a lot to reduce read latency

With new nodes on the way, now is really the time to think about this. We could do the evaluation of (1) using one of the new nodes in a write sampling mode, and a local TWCS override. If it passes muster, then bootstrapping the new nodes with TWCS enabled would save a recompaction later. And, moving to multi-instance (2) would be straightforward when adding the new nodes, far less so down the road.

Happy to help with all this; Let me know!

JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria renamed this task from Better response times Pageview API to Better response times on AQS (Pageview API mostly).Apr 19 2016, 5:44 PM
Nuria renamed this task from Better response times on AQS (Pageview API mostly) to Better response times on AQS (Pageview API mostly) {slug}.Apr 19 2016, 5:47 PM
Nuria renamed this task from Better response times on AQS (Pageview API mostly) {slug} to Better response times on AQS (Pageview API mostly) {melc}.
Nuria moved this task from In Progress to Parent Tasks on the Analytics-Kanban board.
Nuria set the point value for this task to 0.

Removing pageview-API tag to prevent having automatic Analytics tagging (this belongs to Analytics-Kanban now)

@Nemo_bis : there are many issues among them tow main ones: 1) we need new hardware and 2) we need to bound queries to a date range that we can return w/o running into timeouts. https://phabricator.wikimedia.org/T134524

Change 288373 had a related patch set uploaded (by Elukey):
Add a new AQS testing environment to play with Cassandra settings before production.

https://gerrit.wikimedia.org/r/288373

I wanted to speak up and say that I get the throttling notice. I need to use tools to make article traffic queries. It is not a problem for me to wait 15 seconds but when I see an alert like this, it makes me a little afraid that the service is not watched and could go down entirely.

I wanted to speak up and say that I get the throttling notice. I need to use tools to make article traffic queries. It is not a problem for me to wait 15 seconds but when I see an alert like this, it makes me a little afraid that the service is not watched and could go down entirely.

He's speaking of the notice at https://tools.wmflabs.org/massviews (or Langviews), both make batch requests, the former up to 500.

Don't worry @Bluerasberry, it's not going to go down. The throttling I added I think ensures you get all the data you need, and the notice is there just to keep users are informed of the issue and that the kind folks on the Analytics team are working on it :)

@Nemo_bis : there are many issues among them tow main ones: 1) we need new hardware and 2) we need to bound queries to a date range that we can return w/o running into timeouts.

Not really. The first thing is to not lie to clients. Error 500 is very unhelpful.

@Nemo_bis: system sends 500 when it doesn't work, not when there is a controlled error condition

Change 288373 merged by Elukey:
Add a new AQS testing environment to play with Cassandra settings before production.

https://gerrit.wikimedia.org/r/288373

system sends 500 when it doesn't work, not when there is a controlled error condition

Exceeding capacity seems a standard condition to take into account, and certainly one that happens frequently for this service (and any service that queries so much data, I guess).

Summary of the last months of work:

  • The new cluster has been load tested by Joseph in https://etherpad.wikimedia.org/p/analytics-aqs-cassandra and we obtained good performances compared to the actual ones. We are striving for at least one order of magnitude more req/s performances on the new cluster, and for the moment it seems that we are going to meet the expectations.
  • We decided to go for the safest migration path, namely creating and loading data to the new AQS cluster without leveraging Cassandra cluster expansion features. This will allow us to have both new and old AQS clusters up and running and completely independent from each other at migration time, in order to ensure a clean and smooth rollback via confd/pybal/lvs in case an unexpected regression will appear.
  • One major problem that we have been working on (Joseph leading the efforts again) is bulk loading the new cluster with one year of pageviews data. Cassandra offers a new experimental feature to create SSTables in Hadoop and to stream them directly to the cluster. Tests revealed that pure loading time has been cut of by a factor of 5, but SSTable compaction takes more time than expected. Joseph is currently working on it to figure out if it is possible to improve the compaction step.
  • One major problem that we have been working on (Joseph leading the efforts again) is bulk loading the new cluster with one year of pageviews data. Cassandra offers a new experimental feature to create SSTables in Hadoop and to stream them directly to the cluster. Tests revealed that pure loading time has been cut of by a factor of 5, but SSTable compaction takes more time than expected. Joseph is currently working on it to figure out if it is possible to improve the compaction step.

Is this an apples-to-apples comparison using the same compaction strategy? In other words (as I understand it) you are using leveled compaction now, does the sstableloadermethod create greater compaction load than the previous method did with leveled compaction?

@Eevans : It is indeed using the same compaction strategy (Leveled). I assume it is because of while-loading-presorting or something like that that the difference is so big. I have not managed to find a parameter tuning the size of sstables generated by the loaders and I the feeling they are too small (~10Mb) and therefore incur the compaction thing, but I could be wrong.

After a week of almost no progress on a month of bulk loaded data to compact, we wiped the cluster.
We are eager to use the new cluster in prod, and loading takes time, so we decided to with CQL loading even if it's more expensive at load time (it seems really cheaper at compaction time though).

Status update:

We loaded four months of data to the cluster and fixed some misconfigurations (like sstable compression) but after a review of the disks configuration with the Operations team, we decided to move the Cassandra instances' disk partitions to RAID10 arrays (rather than RAID0). This choice will allow us to be more resilient to disk/host failures in the longer term, but it will delay some days the cluster's data backfill of course.

We have also reached out to the Analytics community to know which use cases would require more data than the past 18 months. This will allow us to make a better prediction of how much disk space we'll need during the next months to satisfy our users' needs.

The plan is to re-image one host at the time (so two Cassandra instances down maximum at the same time) and then wait for Cassandra recovery/bootstrapping to bring the new instances up to speed again. From our calculations this work should be finished early next week, but I'll keep this task updated.

Status update:

We have re-imaged all the Cassandra hosts (aqs100[456]) with the new RAID settings and started the data load. We are currently very close to finish loading 14 months of data, optimistically by the end of this week. We have also improved user management moving away from the default cassandra user for everything (T142073) and investigated if anything could be done to improve the performances of the current cluster while waiting for the new one (T143873).

Next steps:

  1. complete the data load (optimistically end of this week)
  2. consistency checks about data loaded and Cassandra/Restbase settings (will require a couple of days max)
  3. final load tests to make sure that the cluster will be able to sustain correctly the current traffic (will require 3/4 days max)
  4. Cluster switch - we'll add the new nodes behind the AQS LVS load balancer leaving the current ones running, and then we'll deprecate old hosts if nothing major appears during the following days.

Given the above timeline we should be able to make the switch before the end of the quarter.

Change 310831 had a related patch set uploaded (by Elukey):
Add the new aqs nodes to conftool-data

https://gerrit.wikimedia.org/r/310831

it's really fast now! :)