Page MenuHomePhabricator

Understand (and if possible, improve) cluster performance under the new storage strategy
Closed, ResolvedPublic

Description

The RESTBase storage strategy has been extensively redesigned, and is now running in production on Cassandra 3.11.0. Production experience with both this new strategy, and Cassandra 3.11 are limited, and expectations are uninformed. In general, the performance of queries that are end-user facing seem quite good, and the distribution of latencies has narrowed immensely.

Screenshot-2018-3-6 Grafana - RESTBase.png (782×1 px, 111 KB)

But 99p Cassandra-local latencies can be apparently quite high and erratic.

Screenshot-2018-3-6 Grafana - Cassandra Tables.png (773×1 px, 283 KB)

Obviously, the important metrics, the ones we're ultimately judged by, are the ones users are exposed to, but it would benefit us to understand this better, and if possible, identify optimizations to improve latency and overall capacity.

This ticket will track efforts to understand the current performance of the system, and identify and implement improvements.


Some Recommendations

Re-visiting Hardware

As part of the work done in T178177: Investigate aberrant Cassandra columnfamily read latency of restbase101{0,2,4}, we discovered that HP machines configured as a JBOD of single-disk RAID-0s performed worse than if configured as regular HBA attached disks. There may however be more to it.

To summarize, we currently have examples of:

  1. Dells w/ Samsung SSD 850 devices
  2. Dells w/ Intel SSDSC2BX01 devices
  3. HPs w/ Samsung SSD 850 devices configured as single-disk RAID-0s
  4. HPs w/ Samsung SSD 850 devices (directly attached disks)
  5. Exactly 1 HP with 4 direct attached HP LK1600GEYMV devices

To complicate comparisons, the cluster is composed of two data-centers, each of which has a distinctly different workload, and not every combination listed above exists in both data-centers. However, see the following graphs:

eqiad

Screenshot-2018-3-9 Grafana - Cassandra System.png (485×1 px, 285 KB)

The above graph is of 1007, 1010, 1011, 1016 which represents configurations #1, 3, 4, and 2 respectively. Here you can see that the Dells (1007 and 1016) perform better than any of the HPs, but it is also clear that the Intel-equipped machine has significantly lower iowait than the Samsung equipped one.

codfw

Screenshot-2018-3-9 Grafana - Cassandra System(1).png (483×1 px, 286 KB)

The above graph is of 2001, 2008, 2009, and 2010 which represents configurations #3, 4, 5, 2 respectively. What is interesting here is that the Samsung-equipped HPs (both single-disk RAID-0 and HBA configured) are characteristically bad, but the lone HP with the HP devices has comparable iowait to the Intel-equipped Dell.

We do not have any HPs equipped with the Intel disks, and it would be interesting to see how that combination compared; The restbase2009 configuration would seem to indicate that perhaps it's a mistake to correlate the poor(er) performance with the smartarray equipped HP machines.

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2018-03-07T14:01:49Z] <urandom> setting trace probability to 0.0, restbase eqiad cassandra cluster - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-07T14:17:33Z] <urandom> reducing compression chunk length to 32kb on "wikipedia_T_page__summary".data - T189057

Some preliminary information from query trace examination:

There doesn't appear to be any particular patterns with respect to the queries themselves, when looking at the slowest traces. Even the "heartbeat" made by the NodeJS driver (SELECT key from system.local) occasionally appears in this list (given our query rate and number of open connections, the rate of these is high, even if the cost is quite low).

1eevans@restbase1007:~$ sudo ./traces --cqlshrc=/etc/cassandra-a/cqlshrc --hostname restbase1007-a.eqiad.wmnet sessions --min-duration 100000
2fa6d0170-219d-11e8-a28a-1d583f325411 | 3067970 | 2018-03-07 00:25:02.983 +0000 UTC | SELECT key from system.local
3ef3a6f30-214e-11e8-baae-33e0e0064f9e | 944219 | 2018-03-06 14:59:13.955 +0000 UTC | SELECT key from system.local
445e329e0-216c-11e8-b786-237a199d4b45 | 607700 | 2018-03-06 18:29:14.75 +0000 UTC | select "key","tid","headers","value","_domain" from "globaldomain_T_mathoid__ng_check"."data" where "key" = ? AND "_domain" = ?
5da1f0550-214b-11e8-b786-237a199d4b45 | 327023 | 2018-03-06 14:37:10.053 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_mobile__ng_lead"."data" where "key" = ? AND "_domain" = ?
6146dee60-214c-11e8-acda-1996b96227c0 | 237093 | 2018-03-06 14:38:47.878 +0000 UTC | select "key","tid","headers","value","_domain" from "enwiki_T_page__summary"."data" where "key" = ? AND "_domain" = ?
79899ec10-2198-11e8-b3f3-416aaf7799e3 | 222255 | 2018-03-06 23:46:31.377 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
8f28ca470-216f-11e8-bb39-ff6135eb5e37 | 187005 | 2018-03-06 18:55:32.919 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_mobile__ng_lead"."data" where "key" = ? AND "_domain" = ?
98c875e30-2206-11e8-baae-33e0e0064f9e | 178413 | 2018-03-07 12:53:35.763 +0000 UTC | SELECT key from system.local
10efee7650-21e5-11e8-b3f3-416aaf7799e3 | 168022 | 2018-03-07 09:00:09.141 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "others_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
11839d3ee0-21c3-11e8-a28a-1d583f325411 | 154539 | 2018-03-07 04:53:44.526 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "enwiki_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
12b4d984c0-214e-11e8-b433-817606c1cc5f | 153646 | 2018-03-06 14:57:36.012 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
134c5f4de0-2198-11e8-bb39-ff6135eb5e37 | 143108 | 2018-03-06 23:44:23.486 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
14e682d620-214e-11e8-853c-bdd6580297de | 139927 | 2018-03-06 14:58:59.33 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
1599c2e970-219d-11e8-a4ec-01f40346f0c7 | 137831 | 2018-03-07 00:22:20.807 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
1677f32120-21c0-11e8-a9bc-6f6f0f99c632 | 137234 | 2018-03-07 04:31:56.466 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
179bbe2100-214c-11e8-b30a-a9cd75a51cff | 123483 | 2018-03-06 14:42:34.896 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
18fd6fe0d0-2153-11e8-853c-bdd6580297de | 115766 | 2018-03-06 15:35:25.277 +0000 UTC | SELECT key from system.local
192d039cd0-214d-11e8-b074-4dc23d7d6be6 | 114673 | 2018-03-06 14:46:38.621 +0000 UTC | select "key","tid","headers","value","_domain" from "globaldomain_T_mathoid__ng_svg"."data" where "key" = ? AND "_domain" = ?
20a8a715d0-219b-11e8-be63-15cfc32ca372 | 114189 | 2018-03-07 00:08:26.797 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
21c73d0d40-21e3-11e8-acda-1996b96227c0 | 110410 | 2018-03-07 08:44:41.876 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
22453fe600-2161-11e8-b074-4dc23d7d6be6 | 106511 | 2018-03-06 17:10:29.216 +0000 UTC | select "title","page_id","rev","tid","namespace","restrictions","tags","user_id","user_text","timestamp","comment","redirect","page_deleted","_domain" from "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe"."data" where "title" = ? AND "_domain" = ?
238493c450-21f9-11e8-acda-1996b96227c0 | 103618 | 2018-03-07 11:20:18.965 +0000 UTC | select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
24
2522 matching results (299702 total).
26eevans@restbase1007:~$

A look at that query with a 3 second(!) latency above (session ID fa6d0170-219d-11e8-a28a-1d583f325411)

1eevans@restbase1007:~$ sudo ./traces --cqlshrc=/etc/cassandra-a/cqlshrc --hostname restbase1007-a.eqiad.wmnet events --id fa6d0170-219d-11e8-a28a-1d583f325411
2 1 |2018-03-07 00:25:02.983 +0000 UTC | restbase1012-a.eqiad.wmnet | 94 | Native-Transport-Requests-10 | Parsing SELECT key from system.local
3 2 |2018-03-07 00:25:02.984 +0000 UTC | restbase1012-a.eqiad.wmnet | 160 | Native-Transport-Requests-10 | Preparing statement
4 3 |2018-03-07 00:25:02.984 +0000 UTC | restbase1012-a.eqiad.wmnet | 225 | Native-Transport-Requests-10 | Computing ranges to query
5 4 |2018-03-07 00:25:02.984 +0000 UTC | restbase1012-a.eqiad.wmnet | 268 | Native-Transport-Requests-10 | Submitting range requests on 1 ranges with a concurrency of 1 (10.8 rows per range expected)
6 5 |2018-03-07 00:25:02.984 +0000 UTC | restbase1012-a.eqiad.wmnet | 302 | Native-Transport-Requests-10 | Submitted 1 concurrent range requests
7 6 |2018-03-07 00:25:06.048 +0000 UTC | restbase1012-a.eqiad.wmnet | 3064796 | ReadStage-135 | Executing seq scan across 12 sstables for (min(-9223372036854775808), min(-9223372036854775808))
8 7 |2018-03-07 00:25:06.049 +0000 UTC | restbase1012-a.eqiad.wmnet | 3065645 | ReadStage-135 | Read 1 live and 0 tombstone cells
9eevans@restbase1007:~$

The jump between 6 and 7 is where all the time is spent. This would appear to be after the range request was enqueued on the READ stage, and before thread ReadStage-135 attempts the actual read. It's not (yet) clear to me what it was that blocked here for 3 seconds.

Another example:

1 1 |2018-03-06 23:46:31.378 +0000 UTC | restbase1015-a.eqiad.wmnet | 8 | MessagingService-Incoming-/10.64.32.178 | READ message received from restbase1008-b.eqiad.wmnet.
2 2 |2018-03-06 23:46:31.378 +0000 UTC | restbase1008-b.eqiad.wmnet | 193 | Native-Transport-Requests-4 | reading data from restbase1007-c.eqiad.wmnet.
3 3 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 8 | MessagingService-Incoming-/10.64.32.178 | READ message received from restbase1008-b.eqiad.wmnet.
4 4 |2018-03-06 23:46:31.378 +0000 UTC | restbase1015-a.eqiad.wmnet | 97 | ReadStage-5 | Executing single-partition query on data
5 5 |2018-03-06 23:46:31.378 +0000 UTC | restbase1008-b.eqiad.wmnet | 249 | Native-Transport-Requests-4 | reading digest from restbase1015-a.eqiad.wmnet.
6 6 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 139 | ReadStage-1 | Executing single-partition query on data
7 7 |2018-03-06 23:46:31.378 +0000 UTC | restbase1015-a.eqiad.wmnet | 131 | ReadStage-5 | Acquiring sstable references
8 8 |2018-03-06 23:46:31.378 +0000 UTC | restbase1008-b.eqiad.wmnet | 285 | MessagingService-Outgoing-restbase1007-c.eqiad.wmnet/10.64.0.232-Small | Sending READ message to restbase1007-c.eqiad.wmnet/10.64.0.232
9 9 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 175 | ReadStage-1 | Acquiring sstable references
1010 |2018-03-06 23:46:31.378 +0000 UTC | restbase1015-a.eqiad.wmnet | 181 | ReadStage-5 | Bloom filter allows skipping sstable 14264
1111 |2018-03-06 23:46:31.378 +0000 UTC | restbase1008-b.eqiad.wmnet | 311 | MessagingService-Outgoing-restbase1015-a.eqiad.wmnet/10.64.48.138-Small | Sending READ message to restbase1015-a.eqiad.wmnet/10.64.48.138
1212 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 224 | ReadStage-1 | Bloom filter allows skipping sstable 32152
1313 |2018-03-06 23:46:31.378 +0000 UTC | restbase1015-a.eqiad.wmnet | 200 | ReadStage-5 | Bloom filter allows skipping sstable 14256
1414 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 238 | ReadStage-1 | Bloom filter allows skipping sstable 32146
1515 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 248 | ReadStage-1 | Bloom filter allows skipping sstable 32141
1616 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 258 | ReadStage-1 | Bloom filter allows skipping sstable 32094
1717 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 269 | ReadStage-1 | Bloom filter allows skipping sstable 31906
1818 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 279 | ReadStage-1 | Bloom filter allows skipping sstable 31816
1919 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 289 | ReadStage-1 | Bloom filter allows skipping sstable 31639
2020 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 298 | ReadStage-1 | Bloom filter allows skipping sstable 29499
2121 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 308 | ReadStage-1 | Bloom filter allows skipping sstable 27327
2222 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 326 | ReadStage-1 | Bloom filter allows skipping sstable 25489
2323 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 335 | ReadStage-1 | Bloom filter allows skipping sstable 21976
2424 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 344 | ReadStage-1 | Skipped 0/12 non-slice-intersecting sstables, included 0 due to tombstones
2525 |2018-03-06 23:46:31.378 +0000 UTC | restbase1007-c.eqiad.wmnet | 415 | ReadStage-1 | Partition index with 0 entries found for sstable 21973
2626 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 221 | ReadStage-5 | Bloom filter allows skipping sstable 14251
2727 |2018-03-06 23:46:31.379 +0000 UTC | restbase1008-b.eqiad.wmnet | 1717 | MessagingService-Incoming-/10.64.0.223 | REQUEST_RESPONSE message received from restbase1007-c.eqiad.wmnet.
2828 |2018-03-06 23:46:31.379 +0000 UTC | restbase1007-c.eqiad.wmnet | 800 | ReadStage-1 | Merged data from memtables and 1 sstables
2929 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 240 | ReadStage-5 | Bloom filter allows skipping sstable 14200
3030 |2018-03-06 23:46:31.379 +0000 UTC | restbase1008-b.eqiad.wmnet | 1769 | RequestResponseStage-2 | Processing response from restbase1007-c.eqiad.wmnet.
3131 |2018-03-06 23:46:31.379 +0000 UTC | restbase1007-c.eqiad.wmnet | 872 | ReadStage-1 | Read 1 live and 0 tombstone cells
3232 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 265 | ReadStage-5 | Bloom filter allows skipping sstable 14102
3333 |2018-03-06 23:46:31.379 +0000 UTC | restbase1007-c.eqiad.wmnet | 895 | ReadStage-1 | Enqueuing response to restbase1008-b.eqiad.wmnet.
3434 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 277 | ReadStage-5 | Bloom filter allows skipping sstable 13955
3535 |2018-03-06 23:46:31.379 +0000 UTC | restbase1007-c.eqiad.wmnet | 961 | MessagingService-Outgoing-restbase1008-b.eqiad.wmnet/10.64.32.195-Small | Sending REQUEST_RESPONSE message to restbase1008-b.eqiad.wmnet/10.64.32.195
3636 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 288 | ReadStage-5 | Bloom filter allows skipping sstable 13806
3737 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 300 | ReadStage-5 | Bloom filter allows skipping sstable 13405
3838 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 311 | ReadStage-5 | Bloom filter allows skipping sstable 13209
3939 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 327 | ReadStage-5 | Bloom filter allows skipping sstable 11663
4040 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 338 | ReadStage-5 | Bloom filter allows skipping sstable 11583
4141 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 349 | ReadStage-5 | Bloom filter allows skipping sstable 9394
4242 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 361 | ReadStage-5 | Bloom filter allows skipping sstable 9317
4343 |2018-03-06 23:46:31.379 +0000 UTC | restbase1015-a.eqiad.wmnet | 422 | ReadStage-5 | Partition index with 0 entries found for sstable 7451
4444 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220430 | ReadStage-5 | Bloom filter allows skipping sstable 3260
4545 |2018-03-06 23:46:31.599 +0000 UTC | restbase1008-b.eqiad.wmnet | 221888 | MessagingService-Incoming-/10.64.48.134 | REQUEST_RESPONSE message received from restbase1015-a.eqiad.wmnet.
4646 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220499 | ReadStage-5 | Bloom filter allows skipping sstable 3222
4747 |2018-03-06 23:46:31.599 +0000 UTC | restbase1008-b.eqiad.wmnet | 221973 | RequestResponseStage-2 | Processing response from restbase1015-a.eqiad.wmnet.
4848 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220706 | ReadStage-5 | Skipped 0/16 non-slice-intersecting sstables, included 0 due to tombstones
4949 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220829 | ReadStage-5 | Merged data from memtables and 1 sstables
5050 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220874 | ReadStage-5 | Read 1 live and 0 tombstone cells
5151 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220890 | ReadStage-5 | Read 1 live and 0 tombstone cells
5252 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220905 | ReadStage-5 | Enqueuing response to restbase1008-b.eqiad.wmnet.
5353 |2018-03-06 23:46:31.599 +0000 UTC | restbase1015-a.eqiad.wmnet | 220986 | MessagingService-Outgoing-restbase1008-b.eqiad.wmnet/10.64.32.195-Small | Sending REQUEST_RESPONSE message to restbase1008-b.eqiad.wmnet/10.64.32.195
54eevans@restbase1007:~$

~200ms accumulates on restbase1015-a between lines 43 & 44. The implication there would be that the operation at line 44 was slow, but it's hard to imagine this being the case with a bloom filter lookup (these are simple, immutable structures kept on-heap). I've consulted the GC logs for both this and the previous example, but find no STW pauses.

Mentioned in SAL (#wikimedia-operations) [2018-03-08T20:34:46Z] <urandom> set compression chunk length to 32, page_summary tables - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T20:40:53Z] <urandom> set compression chunk length to 32, mobile tables - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T21:41:10Z] <urandom> set compression chunk length to 32, parsoid tables (group "others") - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T22:31:55Z] <urandom> set compression chunk length to 32, parsoid tables (group "commons") - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T23:10:13Z] <urandom> set compression chunk length to 32, parsoid tables (group "wikipedia") - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-09T00:03:00Z] <urandom> set compression chunk length to 32, parsoid tables (group "enwiki") - T189057

Eevans renamed this task from Understand (and if possible, improve) performance of new storage strategy to Understand (and if possible, improve) cluster performance under the new storage strategy.May 10 2018, 2:37 PM
mobrovac subscribed.

In the meantime, we have gotten new machines w/o Samsung SSDs. We are also moving off of the new storage strategy (cf. T220461: Simplify storage semantics), so this task is no longer relevant.