Page MenuHomePhabricator

Investigate aberrant Cassandra columnfamily read latency of restbase101{0,2,4}
Closed, ResolvedPublic

Description

The requirements established during the recent storage redesign state that we should provide 99p latencies of no more than 100ms. If the Cassandra dashboards are to believed, we are above our stated target. These latencies should be investigated, and corrected (or understood) before migrating any additional use-cases to the new strategy.

Event Timeline

Eevans triaged this task as High priority.Oct 13 2017, 4:25 PM

Over 30 days (it is getting worse):

Table read latency

Screenshot-2017-10-25 Grafana - Cassandra.png (932×2 px, 331 KB)

And while the client POV doesn't reflect the full latencies seen above, they are still above our SLA, and trending upward:

Client read latency

Screenshot-2017-10-25 Grafana - Cassandra.png (932×2 px, 414 KB)

Eevans renamed this task from Investigate higher than expected Cassandra columnfamiliy read latency to Investigate higher than expected Cassandra columnfamily read latency.Oct 25 2017, 6:19 PM
Eevans added a project: User-Eevans.
Eevans moved this task from Backlog to In-Progress on the User-Eevans board.

It was a goal of the new dashboard to be more compact, more concise, and so the legends were eliminated in favor of the all series hover tool tip. It was clear looking at table latency that it was not isolated to a single instance, but what the visualization obscured was that it is isolated to a single host.

TL;DR In eqiad, the unusually high latency is confined to the instances on a single host, restbase1010. The table latency visualizations have been updated to include a table legend, with min, max, and average values

Screenshot-2017-10-26 Grafana - Cassandra.png (240×1 px, 21 KB)

NOTE: This is good news because a) it would seem to imply that the latency is not due to characteristics of the new strategy, and b) being confined to a single host should make it easier to identify.
Eevans renamed this task from Investigate higher than expected Cassandra columnfamily read latency to Investigate aberrant Cassandra columnfamily read latency of restbase1010.Nov 2 2017, 11:48 AM

Mentioned in SAL (#wikimedia-operations) [2017-11-07T17:38:55Z] <urandom> Restart Cassandra, restbase1010-{a,b,c}.eqiad.wmnet (T178177)

Cassandra-a on restabase1010 was restarted today because of high read latencies causing restbase workers to die on 1013/1015

As discovered by @fgiunchedi, one difference between restbase1010, and the other eqiad nodes is that it has an HP Smart Array controller.

$ cdsh -c restbase-ng -d eqiad -- sudo lspci -d 103c:3239
restbase1010.eqiad.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
$ 

All of the -ng cluster machines have this controller in the codfw data-center, which would make it difficult to make a similar comparison there:

$ cdsh -c restbase-ng -d codfw -- sudo lspci -d 103c:3239
restbase2003.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2001.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2005.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
$ 

Per @fgiunchedi, there is a firmware upgrade available, though from the release notes it's not clear it would address the issues we are seeing.

Additionally, there are 11 machines in the legacy cluster that have this controller, including 5/6 of those involved in the on-going reshape (2002, 2004, 2006, 1012, and 1014):

$ cdsh -c restbase -- sudo lspci -d 103c:3239
restbase1011.eqiad.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase1012.eqiad.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase1013.eqiad.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase1014.eqiad.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase1015.eqiad.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2004.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2008.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2002.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2007.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2006.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
restbase2009.codfw.wmnet: 03:00.0 RAID bus controller: Hewlett-Packard Company Smart Array Gen9 Controllers (rev 01)
$ 

We are currently in the process of bootstrapping instances onto 2002, so we could do one of:

  1. Upgrade the next host (2004?) prior to bootstrapping its instances
  2. Stop after the current bootstraps, shutdown 2002, and upgrade it before continuing
  3. Stop after the current bootstraps, shutdown 1010, and upgrade it before continuing
  4. Wait until T179422: Reshape RESTBase Cassandra clusters is complete, and then revisit firmware upgrades

My vote would be for #1 or #2, @fgiunchedi any thoughts?

Eevans changed the task status from Open to Stalled.Nov 27 2017, 9:10 PM

Mentioned in SAL (#wikimedia-operations) [2017-12-05T21:56:23Z] <urandom> draining cassandra instances, restbase1010 - T178177

Mentioned in SAL (#wikimedia-operations) [2017-12-05T21:58:21Z] <mutante> restbase1010 - upgraded HP firmware (Flashing Smart Array P440ar in Slot 0 [ 3.56 -> 6.06 ]) T141756 T178177

Mentioned in SAL (#wikimedia-operations) [2017-12-05T22:13:15Z] <mutante> restbase1010 failed at reboot with P6431 , after a cold start (power off, power on) it came back though :) (T178177 T141756)

Update: The firmware has been upgraded, but the aberrant latency remains; Back to the drawing board.

Eevans changed the task status from Stalled to Open.Dec 6 2017, 4:22 PM
Eevans lowered the priority of this task from High to Medium.

On closer inspection, and now that restbase1012 and restbase1014 have been bootstrapped in eqiad, and the cluster given time to quiesce, we have.

restbase1007 & restbase1008 & restbase1009 with average 99p latencies in the range of 7-17ms (last 12h).
restbase1010 & restbase1012 & restbase1014 with average 99p latencies in the range of 33-61ms (last 12h).

Broken down like this, the former group (lower latencies) are all Dell, the latter (higher latencies) are all HP.

All of the machines currently making up the new cluster in eqiad have the Samsung 850s.

From a conversation with @Volans on IRC, it might be worthwhile to disabled Smart Path.

14:21 < volans> urandom: so from this few minutes of looking at it, if it was me and it's
    not too complex I would pick one of the new hosts, disable the smart path in teh raid 
    config and see how it goes
14:21 < volans> there are controversial benchmarks in real life workloads of this feature
14:21 -!- subbu is now known as subbu|lunch
14:21 <+wikibugs> (PS7) Ottomata: Puppetization for superset [puppet] - 
    https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689)
14:22 < urandom> volans: smart path?
14:22 < volans> LD Acceleration Method: HPE SSD Smart Path
Eevans renamed this task from Investigate aberrant Cassandra columnfamily read latency of restbase1010 to Investigate aberrant Cassandra columnfamily read latency of restbase101{0,2,4}.Dec 6 2017, 9:24 PM

Disclaimer

My comments are under the assumption that the difference in behaviour is between the new hosts vs the old hosts, as opposed to the first analysis that was pointing to a single host showing higher latencies.

Smart Path

From what I know the Smart Path should help for read-intensive workload at the expense of the writes. Although I've seen various benchmark and blog posts around that found the system more performant with this feature turned OFF, with various workloads.

Looking at the stats in the single-machine Grafana dashboards [1], it seems that we have roughly 62% reads and 38% writes in IOPS, a much more balanced r/w workload of what appears from the Cassandra dashboard [2] where I guess it includes also reads served from RAM.

Given the above I would suggest a test on one host disabling the HPE SSD Smart Path and checking how it goes for a bit. It should be as simple as running the following command (it should cycle over all the "arrays", that in our configuration of JBOD is one per disk):

# NOT TESTED, probably worth depooling the host to be on the safe side when doing it
for i in {a..e}; do
    sudo hpssacli controller slot=0 array ${i} modify ssdsmartpath=disable
done

HPE Raid controller configuration

There are other parameters that might be worth investigate a bit more from the output of:

sudo hpssacli controller slot=0 show config detail

Namely:

Cache Board Present: True
Cache Status: Not Configured
Cache Ratio: 100% Read / 0% Write
Read Cache Size: 0 MB
Write Cache Size: 0 MB
Drive Write Cache: Disabled
Total Cache Size: 2.0 GB
Total Cache Memory Available: 1.8 GB
SSD Caching Version: 2

Physical disk distribution

It seems to me that regarding the 2 internal ports that the controller has, we have 4 disks attached to one and just one disk attached to the other:

# Part of the output of sudo hpssacli controller slot=0 show config detail

   Internal Drive Cage at Port 1I, Box 1, OK
      Power Supply Status: Not Redundant
      Drive Bays: 4
      Port: 1I
      Box: 1
      Location: Internal

   Physical Drives
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1024.2 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1024.2 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 1024.2 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 1024.2 GB, OK)


   Internal Drive Cage at Port 2I, Box 1, OK
      Power Supply Status: Not Redundant
      Drive Bays: 4
      Port: 2I
      Box: 1
      Location: Internal

   Physical Drives
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, Solid State SATA, 1024.2 GB, OK)

I'm not sure if this might affect performances in general - I'm assuming they are sized to handle a full slot configuration without degrading performances - but might be another thing to look at, also from the point of view of redundancy and resiliency to failures.
I didn't check the old DELL hosts but seems that megacli is not installed there and we don't monitor the controller either on Icinga, while we do it for those HP hosts.

Partitioning

I was a bit surprised when I found out the current partition scheme, what was the rationale that lead to it?
AFAICS we do software RAID10 (md) across all 5 disks for 3 partitions, / (28G), /srv/cassandra/instance-data (46G) and swap (1G) (thanks to md that allows to do RAID10 over an odd number of disks 😉 ).
I understand that like this all the disks are symmetric, at the cost of extra redundancy that is probably not needed and less available space, but this means also that writes to those partitions hit all the disks used by Cassandra too.
From the few hosts that I've checked this seems the configuration for the old hosts too, so probably not the culprit here, but potentially worth investigating if we're trying to optimize for performances and reduce the possibility to affect the data disk performances for other kind of activities (like the monthly mdadm crontab to check the arrays, although the last run this Sunday lasted less than 5 minutes).

[1] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=6&fullscreen&var-server=restbase1010&var-datasource=eqiad%20prometheus%2Fops
[2] https://grafana.wikimedia.org/dashboard/db/cassandra?orgId=1

My comments are under the assumption that the difference in behaviour is between the new hosts vs the old hosts, as opposed to the first analysis that was pointing to a single host showing higher latencies.

RIght, the distinction seems to be between the HP hosts and the Dell. The new v old distinction happened by coincidence (that the first 3 were Dell, and the recently added are HP, is an accident).

From what I know the Smart Path should help for read-intensive workload at the expense of the writes. Although I've seen various benchmark and blog posts around that found the system more performant with this feature turned OFF, with various workloads.
[ ... ]
Given the above I would suggest a test on one host disabling the HPE SSD Smart Path and checking how it goes for a bit...

I'm game.

There are other parameters that might be worth investigate a bit more from the output of:

sudo hpssacli controller slot=0 show config detail

Namely:

Cache Board Present: True
Cache Status: Not Configured
Cache Ratio: 100% Read / 0% Write
Read Cache Size: 0 MB
Write Cache Size: 0 MB
Drive Write Cache: Disabled
Total Cache Size: 2.0 GB
Total Cache Memory Available: 1.8 GB
SSD Caching Version: 2

Ok; Any specific suggestions here?

I was a bit surprised when I found out the current partition scheme, what was the rationale that lead to it?

Oh boy. This is a good question. A great question, actually. There is a reason, believe it or not, but explaining it adequately requires a lot of background information. Let me try to write something up in a wiki page or something, and link back to it here.

Mentioned in SAL (#wikimedia-operations) [2017-12-08T16:15:13Z] <urandom> shutting down cassandra, restbase1010 - T178177

Mentioned in SAL (#wikimedia-operations) [2017-12-08T16:20:08Z] <urandom> disabling smart path, restbase1010, array 'a' (canary) - T178177

Mentioned in SAL (#wikimedia-operations) [2017-12-08T16:22:08Z] <urandom> disabling smart path, restbase1010, arrays 'b'...'e' - T178177

Mentioned in SAL (#wikimedia-operations) [2017-12-08T16:23:14Z] <urandom> starting cassandra, restbase1010 - T178177

Smart Path has been disabled.

eevans@restbase1010:~$ sudo hpssacli controller slot=0 show config detail | grep -A 6 -E "Array: [A-E]"
   Array: A
      Interface Type: Solid State SATA
      Unused Space: 0  MB (0.0%)
      Used Space: 953.8 GB (100.0%)
      Status: OK
      MultiDomain Status: OK
      Array Type: Data       HPE SSD Smart Path: disable
--
   Array: B
      Interface Type: Solid State SATA
      Unused Space: 0  MB (0.0%)
      Used Space: 953.8 GB (100.0%)
      Status: OK
      MultiDomain Status: OK
      Array Type: Data       HPE SSD Smart Path: disable
--
   Array: C
      Interface Type: Solid State SATA
      Unused Space: 0  MB (0.0%)
      Used Space: 953.8 GB (100.0%)
      Status: OK
      MultiDomain Status: OK
      Array Type: Data       HPE SSD Smart Path: disable
--
   Array: D
      Interface Type: Solid State SATA
      Unused Space: 0  MB (0.0%)
      Used Space: 953.8 GB (100.0%)
      Status: OK
      MultiDomain Status: OK
      Array Type: Data       HPE SSD Smart Path: disable
--
   Array: E
      Interface Type: Solid State SATA
      Unused Space: 0  MB (0.0%)
      Used Space: 953.8 GB (100.0%)
      Status: OK
      MultiDomain Status: OK
      Array Type: Data       HPE SSD Smart Path: disable
eevans@restbase1010:~$

Mentioned in SAL (#wikimedia-operations) [2017-12-11T19:43:48Z] <urandom> lower compaction throughput to 2 MB/s, restbase1010-{a,b,c} - T178177

We now have ~3 days of metrics since Smart Path was disabled. It almost looks like it could be a tiny bit less spike-y, but I wouldn't swear to it; I don't think this had any/much effect.

Instance latencies

Screenshot-2017-12-11 Grafana - Cassandra.png (936×2 px, 137 KB)

Screenshot-2017-12-11 Grafana - Cassandra(1).png (936×2 px, 157 KB)

Screenshot-2017-12-11 Grafana - Cassandra(2).png (936×2 px, 145 KB)

CPU iowait

Screenshot-2017-12-11 Grafana - Cassandra System.png (936×2 px, 147 KB)

NOTE: Icinga has this host in a warning state now, so we should consider switching this back.

It does seem like the extremes occur less often and are slightly lower, but yeah, I would agree that overall this doesn't seem to have been a net win :/

More data:

Screenshot-2017-12-11 render (PNG Image, 1024 × 250 pixels).png (500×2 px, 253 KB)

In the legacy cluster, restbase1011 is an HP (w/ smart controller) and restbase1016 a Dell (the former has Samsung SSDs, the latter Intel).

Screenshot-2017-12-11 render (PNG Image, 1024 × 250 pixels)(1).png (500×2 px, 390 KB)

In the legacy cluster (and in codfw, where the workload is quite different), restbase2009 and restbase2012 (HP and Dell respectively), both equipped with Intel SSDs. The absolute numbers are quite low, but the HP host still has iowait times that are double-ish.

I'm sorry the test did't helped.
Digging a bit more it seems that the controller that we have (Smart Array P440ar) supports HBA mode (Host Bus Adapter), that, according to HP manual [1]:

In HBA mode, all physical drives are presented directly to the operating system and the hardware RAID engine is disabled.

This of course bypass also any possible cache of the controller, but given the specific use case it could help.
The problem to test this is that it seems that most likely the data will be lost once converting it to HBA mode, as the manual [1] says:

A prompt appears warning you that entering HBA mode will disable any drives configured using a Smart Array until the configuration is cleared. A reboot is required for HBA mode to be enabled.

It seems to mean that the RAID configuration for those disks must be reset/cleared, that most likely will wipe them.

Although the manual describe only how to enable this mode via their UI tool, it seems, according to a real life example [2], that it's possible to set this mode also via the hpssacli CLI.

I've already verified with Rob that we don't have any spare system with similar hardware to make a separate test.

It's up to you evaluate if this test can be done, assuming we'll loose the local data, given the current cluster status and WIP migration.

[1] https://support.hpe.com/hpsc/doc/public/display?docId=c03909334 (pag 48-50)
[2] https://pcdmarc.wordpress.com/2015/12/05/vmware-vsan-with-hp-p440-hbas/

Mentioned in SAL (#wikimedia-operations) [2018-02-06T14:07:24Z] <urandom> re-enable smartpath on restbase1010 (revert experiment) - T178177

At this point, I think we've established that more reasonable performance is possible by configuring a JBOD in HBA mode, instead of as a collection of single-disk RAID0s. Many of the existing HP nodes in the RESTBase cluster have already been configured this way, and all that remains is to re-image the 9 nodes stood up prior to this conclusion. That work is being tracked in T186562: Reimage JBO-RAID0 configured RESTBase HP machines, so I'll close this issue. If anyone objects (for example, they're convinced that more can/should be done), then feel free to re-open.