Understand (and if possible, improve) cluster performance under the new storage strategy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Mar 6 2018, 10:49 PM

Description

The RESTBase storage strategy has been extensively redesigned, and is now running in production on Cassandra 3.11.0. Production experience with both this new strategy, and Cassandra 3.11 are limited, and expectations are uninformed. In general, the performance of queries that are end-user facing seem quite good, and the distribution of latencies has narrowed immensely.

Screenshot-2018-3-6 Grafana - RESTBase.png (782×1 px, 111 KB)

But 99p Cassandra-local latencies can be apparently quite high and erratic.

Screenshot-2018-3-6 Grafana - Cassandra Tables.png (773×1 px, 283 KB)

Obviously, the important metrics, the ones we're ultimately judged by, are the ones users are exposed to, but it would benefit us to understand this better, and if possible, identify optimizations to improve latency and overall capacity.

This ticket will track efforts to understand the current performance of the system, and identify and implement improvements.

Some Recommendations

T186562: Reimage JBO-RAID0 configured RESTBase HP machines
Consider disabling read_repair_chance
Considering tuning (or disabling) speculative retry
Consider decreasing compression chunk lengths

Re-visiting Hardware

As part of the work done in T178177: Investigate aberrant Cassandra columnfamily read latency of restbase101{0,2,4}, we discovered that HP machines configured as a JBOD of single-disk RAID-0s performed worse than if configured as regular HBA attached disks. There may however be more to it.

To summarize, we currently have examples of:

Dells w/ Samsung SSD 850 devices
Dells w/ Intel SSDSC2BX01 devices
HPs w/ Samsung SSD 850 devices configured as single-disk RAID-0s
HPs w/ Samsung SSD 850 devices (directly attached disks)
Exactly 1 HP with 4 direct attached HP LK1600GEYMV devices

To complicate comparisons, the cluster is composed of two data-centers, each of which has a distinctly different workload, and not every combination listed above exists in both data-centers. However, see the following graphs:

eqiad

Screenshot-2018-3-9 Grafana - Cassandra System.png (485×1 px, 285 KB)

The above graph is of 1007, 1010, 1011, 1016 which represents configurations #1, 3, 4, and 2 respectively. Here you can see that the Dells (1007 and 1016) perform better than any of the HPs, but it is also clear that the Intel-equipped machine has significantly lower iowait than the Samsung equipped one.

codfw

Screenshot-2018-3-9 Grafana - Cassandra System(1).png (483×1 px, 286 KB)

The above graph is of 2001, 2008, 2009, and 2010 which represents configurations #3, 4, 5, 2 respectively. What is interesting here is that the Samsung-equipped HPs (both single-disk RAID-0 and HBA configured) are characteristically bad, but the lone HP with the HP devices has comparable iowait to the Intel-equipped Dell.

We do not have any HPs equipped with the Intel disks, and it would be interesting to see how that combination compared; The restbase2009 configuration would seem to indicate that perhaps it's a mistake to correlate the poor(er) performance with the smartarray equipped HP machines.

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T324931 Clean up open RESTBase related tickets
In Progress	None	T262315 <CORE TECHNOLOGY> API Migration & RESTbase Sunset
Resolved	DAlangi_WMF	T324678 Migrate proton (chromium-render) away from restbase
Open	None	T167603 Any Chinese Wiki's projects about "Download as PDF" can not auto change to Simplified Chinese or Traditional Chinese
Resolved	ovasileva	T147553 [EPIC] Page previews broken on many projects
Open	ovasileva	T244262 [Epic] Enable page previews and reference previews as a beta feature on all projects
Open	None	T111231 Page previews for Wikidata
Invalid	None	T148854 Use RESTBase for zhwiki
Resolved	• Pchelolo	T188164 Popups don‘t support language variant conversion and {{lang}} template
Resolved	• mobrovac	T190689 FY17/18 Q4 Program 7 Services Goal: Language variants support
Resolved	Eevans	T183745 FY17/18 Q3 Program 7 Services Goal: Full migration to Cassandra 3
Resolved	Eevans	T189057 Understand (and if possible, improve) cluster performance under the new storage strategy
Resolved	fgiunchedi	T186562 Reimage JBO-RAID0 configured RESTBase HP machines
Resolved	Eevans	T189822 Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs
		Unknown Object (Task)

Event Timeline

Eevans created this task.Mar 6 2018, 10:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2018, 10:49 PM

Eevans triaged this task as Medium priority.Mar 6 2018, 10:51 PM

Eevans updated the task description. (Show Details)

Eevans added projects: Services, User-Eevans, Cassandra, RESTBase-Cassandra.

Eevans added a subtask: T186562: Reimage JBO-RAID0 configured RESTBase HP machines.

Eevans added a parent task: T183745: FY17/18 Q3 Program 7 Services Goal: Full migration to Cassandra 3.Mar 6 2018, 10:53 PM

Mentioned in SAL (#wikimedia-operations) [2018-03-07T14:01:49Z] <urandom> setting trace probability to 0.0, restbase eqiad cassandra cluster - T189057

• mobrovac edited projects, added Services (doing); removed Services.Mar 7 2018, 2:04 PM

Mentioned in SAL (#wikimedia-operations) [2018-03-07T14:17:33Z] <urandom> reducing compression chunk length to 32kb on "wikipedia_T_page__summary".data - T189057

Some preliminary information from query trace examination:

There doesn't appear to be any particular patterns with respect to the queries themselves, when looking at the slowest traces. Even the "heartbeat" made by the NodeJS driver (SELECT key from system.local) occasionally appears in this list (given our query rate and number of open connections, the rate of these is high, even if the cost is quite low).

P6809 sessions

1	eevans@restbase1007:~$ sudo ./traces --cqlshrc=/etc/cassandra-a/cqlshrc --hostname restbase1007-a.eqiad.wmnet sessions --min-duration 100000
2	fa6d0170-219d-11e8-a28a-1d583f325411 \| 3067970 \| 2018-03-07 00:25:02.983 +0000 UTC \| SELECT key from system.local
3	ef3a6f30-214e-11e8-baae-33e0e0064f9e \| 944219 \| 2018-03-06 14:59:13.955 +0000 UTC \| SELECT key from system.local
4	45e329e0-216c-11e8-b786-237a199d4b45 \| 607700 \| 2018-03-06 18:29:14.75 +0000 UTC \| select "key","tid","headers","value","_domain" from "globaldomain_T_mathoid__ng_check"."data" where "key" = ? AND "_domain" = ?
5	da1f0550-214b-11e8-b786-237a199d4b45 \| 327023 \| 2018-03-06 14:37:10.053 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_mobile__ng_lead"."data" where "key" = ? AND "_domain" = ?
6	146dee60-214c-11e8-acda-1996b96227c0 \| 237093 \| 2018-03-06 14:38:47.878 +0000 UTC \| select "key","tid","headers","value","_domain" from "enwiki_T_page__summary"."data" where "key" = ? AND "_domain" = ?
7	9899ec10-2198-11e8-b3f3-416aaf7799e3 \| 222255 \| 2018-03-06 23:46:31.377 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
8	f28ca470-216f-11e8-bb39-ff6135eb5e37 \| 187005 \| 2018-03-06 18:55:32.919 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_mobile__ng_lead"."data" where "key" = ? AND "_domain" = ?
9	8c875e30-2206-11e8-baae-33e0e0064f9e \| 178413 \| 2018-03-07 12:53:35.763 +0000 UTC \| SELECT key from system.local
10	efee7650-21e5-11e8-b3f3-416aaf7799e3 \| 168022 \| 2018-03-07 09:00:09.141 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "others_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
11	839d3ee0-21c3-11e8-a28a-1d583f325411 \| 154539 \| 2018-03-07 04:53:44.526 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "enwiki_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
12	b4d984c0-214e-11e8-b433-817606c1cc5f \| 153646 \| 2018-03-06 14:57:36.012 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
13	4c5f4de0-2198-11e8-bb39-ff6135eb5e37 \| 143108 \| 2018-03-06 23:44:23.486 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
14	e682d620-214e-11e8-853c-bdd6580297de \| 139927 \| 2018-03-06 14:58:59.33 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
15	99c2e970-219d-11e8-a4ec-01f40346f0c7 \| 137831 \| 2018-03-07 00:22:20.807 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
16	77f32120-21c0-11e8-a9bc-6f6f0f99c632 \| 137234 \| 2018-03-07 04:31:56.466 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
17	9bbe2100-214c-11e8-b30a-a9cd75a51cff \| 123483 \| 2018-03-06 14:42:34.896 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
18	fd6fe0d0-2153-11e8-853c-bdd6580297de \| 115766 \| 2018-03-06 15:35:25.277 +0000 UTC \| SELECT key from system.local
19	2d039cd0-214d-11e8-b074-4dc23d7d6be6 \| 114673 \| 2018-03-06 14:46:38.621 +0000 UTC \| select "key","tid","headers","value","_domain" from "globaldomain_T_mathoid__ng_svg"."data" where "key" = ? AND "_domain" = ?
20	a8a715d0-219b-11e8-be63-15cfc32ca372 \| 114189 \| 2018-03-07 00:08:26.797 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
21	c73d0d40-21e3-11e8-acda-1996b96227c0 \| 110410 \| 2018-03-07 08:44:41.876 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
22	453fe600-2161-11e8-b074-4dc23d7d6be6 \| 106511 \| 2018-03-06 17:10:29.216 +0000 UTC \| select "title","page_id","rev","tid","namespace","restrictions","tags","user_id","user_text","timestamp","comment","redirect","page_deleted","_domain" from "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe"."data" where "title" = ? AND "_domain" = ?
23	8493c450-21f9-11e8-acda-1996b96227c0 \| 103618 \| 2018-03-07 11:20:18.965 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
24
25	22 matching results (299702 total).
26	eevans@restbase1007:~$

A look at that query with a 3 second(!) latency above (session ID fa6d0170-219d-11e8-a28a-1d583f325411)

P6810 fa6d0170-219d-11e8-a28a-1d583f325411

1	eevans@restbase1007:~$ sudo ./traces --cqlshrc=/etc/cassandra-a/cqlshrc --hostname restbase1007-a.eqiad.wmnet events --id fa6d0170-219d-11e8-a28a-1d583f325411
2	1 \|2018-03-07 00:25:02.983 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 94 \| Native-Transport-Requests-10 \| Parsing SELECT key from system.local
3	2 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 160 \| Native-Transport-Requests-10 \| Preparing statement
4	3 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 225 \| Native-Transport-Requests-10 \| Computing ranges to query
5	4 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 268 \| Native-Transport-Requests-10 \| Submitting range requests on 1 ranges with a concurrency of 1 (10.8 rows per range expected)
6	5 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 302 \| Native-Transport-Requests-10 \| Submitted 1 concurrent range requests
7	6 \|2018-03-07 00:25:06.048 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 3064796 \| ReadStage-135 \| Executing seq scan across 12 sstables for (min(-9223372036854775808), min(-9223372036854775808))
8	7 \|2018-03-07 00:25:06.049 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 3065645 \| ReadStage-135 \| Read 1 live and 0 tombstone cells
9	eevans@restbase1007:~$

The jump between 6 and 7 is where all the time is spent. This would appear to be after the range request was enqueued on the READ stage, and before thread ReadStage-135 attempts the actual read. It's not (yet) clear to me what it was that blocked here for 3 seconds.

Another example:

P6817 session ID: 9899ec10-2198-11e8-b3f3-416aaf7799e3

2 2 |2018-03-06 23:46:31.378 +0000 UTC 3 3 |2018-03-06 23:46:31.378 +0000 UTC 4 4 |2018-03-06 23:46:31.378 +0000 UTC 5 5 |2018-03-06 23:46:31.378 +0000 UTC 6 6 |2018-03-06 23:46:31.378 +0000 UTC 7 7 |2018-03-06 23:46:31.378 +0000 UTC 8 8 |2018-03-06 23:46:31.378 +0000 UTC 9 9 |2018-03-06 23:46:31.378 +0000 UTC 1010 |2018-03-06 23:46:31.378 +0000 UTC 1111 |2018-03-06 23:46:31.378 +0000 UTC 1212 |2018-03-06 23:46:31.378 +0000 UTC 1313 |2018-03-06 23:46:31.378 +0000 UTC 1414 |2018-03-06 23:46:31.378 +0000 UTC 1515 |2018-03-06 23:46:31.378 +0000 UTC 1616 |2018-03-06 23:46:31.378 +0000 UTC 1717 |2018-03-06 23:46:31.378 +0000 UTC 1818 |2018-03-06 23:46:31.378 +0000 UTC 1919 |2018-03-06 23:46:31.378 +0000 UTC 2020 |2018-03-06 23:46:31.378 +0000 UTC 2121 |2018-03-06 23:46:31.378 +0000 UTC 2222 |2018-03-06 23:46:31.378 +0000 UTC 2323 |2018-03-06 23:46:31.378 +0000 UTC 2424 |2018-03-06 23:46:31.378 +0000 UTC 2525 |2018-03-06 23:46:31.378 +0000 UTC 2626 |2018-03-06 23:46:31.379 +0000 UTC 2727 |2018-03-06 23:46:31.379 +0000 UTC 2828 |2018-03-06 23:46:31.379 +0000 UTC 2929 |2018-03-06 23:46:31.379 +0000 UTC 3030 |2018-03-06 23:46:31.379 +0000 UTC 3131 |2018-03-06 23:46:31.379 +0000 UTC 3232 |2018-03-06 23:46:31.379 +0000 UTC 3333 |2018-03-06 23:46:31.379 +0000 UTC 3434 |2018-03-06 23:46:31.379 +0000 UTC 3535 |2018-03-06 23:46:31.379 +0000 UTC 3636 |2018-03-06 23:46:31.379 +0000 UTC 3737 |2018-03-06 23:46:31.379 +0000 UTC 3838 |2018-03-06 23:46:31.379 +0000 UTC 3939 |2018-03-06 23:46:31.379 +0000 UTC 4040 |2018-03-06 23:46:31.379 +0000 UTC 4141 |2018-03-06 23:46:31.379 +0000 UTC 4242 |2018-03-06 23:46:31.379 +0000 UTC 4343 |2018-03-06 23:46:31.379 +0000 UTC 4444 |2018-03-06 23:46:31.599 +0000 UTC 4545 |2018-03-06 23:46:31.599 +0000 UTC 4646 |2018-03-06 23:46:31.599 +0000 UTC 4747 |2018-03-06 23:46:31.599 +0000 UTC 4848 |2018-03-06 23:46:31.599 +0000 UTC 4949 |2018-03-06 23:46:31.599 +0000 UTC 5050 |2018-03-06 23:46:31.599 +0000 UTC 5151 |2018-03-06 23:46:31.599 +0000 UTC 5252 |2018-03-06 23:46:31.599 +0000 UTC 5353 |2018-03-06 23:46:31.599 +0000 UTC 54eevans@restbase1007:~$

~200ms accumulates on restbase1015-a between lines 43 & 44. The implication there would be that the operation at line 44 was slow, but it's hard to imagine this being the case with a bloom filter lookup (these are simple, immutable structures kept on-heap). I've consulted the GC logs for both this and the previous example, but find no STW pauses.

Mentioned in SAL (#wikimedia-operations) [2018-03-08T20:34:46Z] <urandom> set compression chunk length to 32, page_summary tables - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T20:40:53Z] <urandom> set compression chunk length to 32, mobile tables - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T21:41:10Z] <urandom> set compression chunk length to 32, parsoid tables (group "others") - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T22:31:55Z] <urandom> set compression chunk length to 32, parsoid tables (group "commons") - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-08T23:10:13Z] <urandom> set compression chunk length to 32, parsoid tables (group "wikipedia") - T189057

Mentioned in SAL (#wikimedia-operations) [2018-03-09T00:03:00Z] <urandom> set compression chunk length to 32, parsoid tables (group "enwiki") - T189057

Eevans updated the task description. (Show Details)Mar 9 2018, 9:56 PM

Eevans mentioned this in T186562: Reimage JBO-RAID0 configured RESTBase HP machines.Mar 13 2018, 6:54 PM

Eevans mentioned this in T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs.Mar 15 2018, 9:52 PM

• mobrovac added a parent task: T190689: FY17/18 Q4 Program 7 Services Goal: Language variants support.Apr 23 2018, 12:25 PM

Eevans closed subtask T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs as Resolved.May 10 2018, 2:34 PM

Eevans renamed this task from Understand (and if possible, improve) performance of new storage strategy to Understand (and if possible, improve) cluster performance under the new storage strategy.May 10 2018, 2:37 PM

• mobrovac added a project: Platform Team Workboards (Doing).Dec 20 2018, 12:43 PM

fgiunchedi closed subtask T186562: Reimage JBO-RAID0 configured RESTBase HP machines as Resolved.Jan 18 2019, 1:41 PM

In the meantime, we have gotten new machines w/o Samsung SSDs. We are also moving off of the new storage strategy (cf. T220461: Simplify storage semantics), so this task is no longer relevant.

1	eevans@restbase1007:~$ sudo ./traces --cqlshrc=/etc/cassandra-a/cqlshrc --hostname restbase1007-a.eqiad.wmnet sessions --min-duration 100000
2	fa6d0170-219d-11e8-a28a-1d583f325411 \| 3067970 \| 2018-03-07 00:25:02.983 +0000 UTC \| SELECT key from system.local
3	ef3a6f30-214e-11e8-baae-33e0e0064f9e \| 944219 \| 2018-03-06 14:59:13.955 +0000 UTC \| SELECT key from system.local
4	45e329e0-216c-11e8-b786-237a199d4b45 \| 607700 \| 2018-03-06 18:29:14.75 +0000 UTC \| select "key","tid","headers","value","_domain" from "globaldomain_T_mathoid__ng_check"."data" where "key" = ? AND "_domain" = ?
5	da1f0550-214b-11e8-b786-237a199d4b45 \| 327023 \| 2018-03-06 14:37:10.053 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_mobile__ng_lead"."data" where "key" = ? AND "_domain" = ?
6	146dee60-214c-11e8-acda-1996b96227c0 \| 237093 \| 2018-03-06 14:38:47.878 +0000 UTC \| select "key","tid","headers","value","_domain" from "enwiki_T_page__summary"."data" where "key" = ? AND "_domain" = ?
7	9899ec10-2198-11e8-b3f3-416aaf7799e3 \| 222255 \| 2018-03-06 23:46:31.377 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
8	f28ca470-216f-11e8-bb39-ff6135eb5e37 \| 187005 \| 2018-03-06 18:55:32.919 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_mobile__ng_lead"."data" where "key" = ? AND "_domain" = ?
9	8c875e30-2206-11e8-baae-33e0e0064f9e \| 178413 \| 2018-03-07 12:53:35.763 +0000 UTC \| SELECT key from system.local
10	efee7650-21e5-11e8-b3f3-416aaf7799e3 \| 168022 \| 2018-03-07 09:00:09.141 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "others_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
11	839d3ee0-21c3-11e8-a28a-1d583f325411 \| 154539 \| 2018-03-07 04:53:44.526 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "enwiki_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
12	b4d984c0-214e-11e8-b433-817606c1cc5f \| 153646 \| 2018-03-06 14:57:36.012 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
13	4c5f4de0-2198-11e8-bb39-ff6135eb5e37 \| 143108 \| 2018-03-06 23:44:23.486 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
14	e682d620-214e-11e8-853c-bdd6580297de \| 139927 \| 2018-03-06 14:58:59.33 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
15	99c2e970-219d-11e8-a4ec-01f40346f0c7 \| 137831 \| 2018-03-07 00:22:20.807 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
16	77f32120-21c0-11e8-a9bc-6f6f0f99c632 \| 137234 \| 2018-03-07 04:31:56.466 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
17	9bbe2100-214c-11e8-b30a-a9cd75a51cff \| 123483 \| 2018-03-06 14:42:34.896 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
18	fd6fe0d0-2153-11e8-853c-bdd6580297de \| 115766 \| 2018-03-06 15:35:25.277 +0000 UTC \| SELECT key from system.local
19	2d039cd0-214d-11e8-b074-4dc23d7d6be6 \| 114673 \| 2018-03-06 14:46:38.621 +0000 UTC \| select "key","tid","headers","value","_domain" from "globaldomain_T_mathoid__ng_svg"."data" where "key" = ? AND "_domain" = ?
20	a8a715d0-219b-11e8-be63-15cfc32ca372 \| 114189 \| 2018-03-07 00:08:26.797 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?
21	c73d0d40-21e3-11e8-acda-1996b96227c0 \| 110410 \| 2018-03-07 08:44:41.876 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
22	453fe600-2161-11e8-b074-4dc23d7d6be6 \| 106511 \| 2018-03-06 17:10:29.216 +0000 UTC \| select "title","page_id","rev","tid","namespace","restrictions","tags","user_id","user_text","timestamp","comment","redirect","page_deleted","_domain" from "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe"."data" where "title" = ? AND "_domain" = ?
23	8493c450-21f9-11e8-acda-1996b96227c0 \| 103618 \| 2018-03-07 11:20:18.965 +0000 UTC \| select "key","rev","tid","value","content-location","content-type","tags","_domain" from "wikipedia_T_parsoid__ng_html"."data" where "key" = ? AND "_domain" = ?
24
25	22 matching results (299702 total).
26	eevans@restbase1007:~$

1	eevans@restbase1007:~$ sudo ./traces --cqlshrc=/etc/cassandra-a/cqlshrc --hostname restbase1007-a.eqiad.wmnet events --id fa6d0170-219d-11e8-a28a-1d583f325411
2	1 \|2018-03-07 00:25:02.983 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 94 \| Native-Transport-Requests-10 \| Parsing SELECT key from system.local
3	2 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 160 \| Native-Transport-Requests-10 \| Preparing statement
4	3 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 225 \| Native-Transport-Requests-10 \| Computing ranges to query
5	4 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 268 \| Native-Transport-Requests-10 \| Submitting range requests on 1 ranges with a concurrency of 1 (10.8 rows per range expected)
6	5 \|2018-03-07 00:25:02.984 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 302 \| Native-Transport-Requests-10 \| Submitted 1 concurrent range requests
7	6 \|2018-03-07 00:25:06.048 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 3064796 \| ReadStage-135 \| Executing seq scan across 12 sstables for (min(-9223372036854775808), min(-9223372036854775808))
8	7 \|2018-03-07 00:25:06.049 +0000 UTC \| restbase1012-a.eqiad.wmnet \| 3065645 \| ReadStage-135 \| Read 1 live and 0 tombstone cells
9	eevans@restbase1007:~$

	F14871926: Screenshot-2018-3-9 Grafana - Cassandra System.png
	Mar 9 2018, 9:56 PM

	F14871940: Screenshot-2018-3-9 Grafana - Cassandra System(1).png
	Mar 9 2018, 9:56 PM

	F14552056: Screenshot-2018-3-6 Grafana - Cassandra Tables.png
	Mar 6 2018, 10:49 PM

	F14551632: Screenshot-2018-3-6 Grafana - RESTBase.png
	Mar 6 2018, 10:49 PM

Understand (and if possible, improve) cluster performance under the new storage strategyClosed, ResolvedPublicActions