Page MenuHomePhabricator

Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs
Closed, ResolvedPublic

Description

T186562 requires that 9 machines in the RESTBase Cassandra cluster be re-imaged in order to configure the SmartArray controller in HBA mode. We would like to use the first of these as an opportunity to conduct an experiment whereby the 5 1T Samsung SSD 850 devices would be replaced with 4 of either the 1.6T Intel, or HP SSDs used elsewhere in the cluster.

More detailed rationale for this can be found in T189057: Understand (and if possible, improve) cluster performance under the new storage strategy, but short-version: We have a broad matrix of host-disk combinations, each with differing IO performance, but the combination of HP and Samsung disks seems problematic. We have been treating this as an issue of relatively poor performance of these HPs when compared to Dells, but restbase2009, the one HP that does not have Samsungs would seem to call this into question, (it is populated with HP LK1600GEYMV devices).

Event Timeline

Eevans triaged this task as Medium priority.Mar 15 2018, 9:52 PM
Eevans created this task.
RobH subscribed.
This comment was removed by RobH.

I'm a bit confused on which system will get the replacement SSDs installed? I'm guessing its restbase2007 or restbase2008, as those each have 5 Samsung 850 EVO SSDs.

I'll generate a sub-task with pricing, since it cannot be in a public task. On that sub-task, I'll request a quote from Dasher for the Intel SSDs, since the restbase200[78] systems are in warranty until 2019-04-22.

The SSDs we get these days from Intels line are: https://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s4500-s4600-brief.html

I'm a bit confused on which system will get the replacement SSDs installed? I'm guessing its restbase2007 or restbase2008, as those each have 5 Samsung 850 EVO SSDs.

I'll generate a sub-task with pricing, since it cannot be in a public task. On that sub-task, I'll request a quote from Dasher for the Intel SSDs, since the restbase200[78] systems are in warranty until 2019-04-22.

It could be any of:

  • restbase1010
  • restbase1012
  • restbase1014
  • restbase2003
  • restbase2004
  • restbase2001
  • restbase2002
  • restbase2005
  • restbase2006

These are the machines that we're currently planning to take down, reconfigure, and reimage anyway as part of T186562: Reimage JBO-RAID0 configured RESTBase HP machines. As an experiment, if replacing these devices does not support the conclusion that the Samsung SSDs are the root of our IO problems, then these will be the only ones we procure, and we'll proceed with T186562 as planned, (but at least we won't have double-handled any hosts). If it does support the conclusion that the Samsungs are to blame, then (hopefully) after a difficult discussion, we'll replace all of the Samnsung SSDs (including those in restbase200[78]).

Hrmm, it looks like there are no 1.6T disks in this line; Options include 240GB, 480GB, 960GB, 2TB (1.92TB), 4TB (3.84TB).

Our original capacity was based on 5 of the 1T Samsungs (5T). We later introduced some 1.6T Intels at 4 per-host (6.4T), but we have to base cluster capacity off the LCD, in this case 5T. All of that being said, now that we're only storing recent revisions, we're definitely over-provisioned on storage space, so it begs the question (especially if this is going to result in a mass-replacement of Samsung SSDs), what do we need storage-wise?

One option: We order 5 @ 960G Intel and keep host storage at ~5T.

Another option: We order 2 @ 2T (provided 2 devices would provide enough IO), or 4 @ 960G and make host storage ~4T. If we did this, it would be possible to reclaim 1 1.6T SSD from each of the hosts equipped with them, giving us enough to save us on SSDs for two machines (and it might be possible to do this online).

It could be any of:

  • restbase1010
  • restbase1012
  • restbase1014

These are owned by us, so I'll use restbase1010 for the ssd swap. Restbase1010 has the added benefit of it has non HP supported Samgsung EVO disks from past restbase systems. Ordering these SSDs from HP will bring this systems SSDs into system warranty.

  • restbase2003
  • restbase2004
  • restbase2001
  • restbase2002
  • restbase2005
  • restbase2006

cannot change SSDs in restbase200[1-6], as they are LEASED. Changing hardware in a lease is not recommended, and doing so really needs @faidon's sign off since it is a tracking nightmare.

These are the machines that we're currently planning to take down, reconfigure, and reimage anyway as part of T186562: Reimage JBO-RAID0 configured RESTBase HP machines. As an experiment, if replacing these devices does not support the conclusion that the Samsung SSDs are the root of our IO problems, then these will be the only ones we procure, and we'll proceed with T186562 as planned, (but at least we won't have double-handled any hosts). If it does support the conclusion that the Samsungs are to blame, then (hopefully) after a difficult discussion, we'll replace all of the Samnsung SSDs (including those in restbase200[78]).

I'll update my quote request to reflect below, so dual 2TB and quad 960GB options. I'll simply request it for dual and quad options (since we'll have the per SSD price, determining use of 5 SSDs is easily done).

Hrmm, it looks like there are no 1.6T disks in this line; Options include 240GB, 480GB, 960GB, 2TB (1.92TB), 4TB (3.84TB).

Our original capacity was based on 5 of the 1T Samsungs (5T). We later introduced some 1.6T Intels at 4 per-host (6.4T), but we have to base cluster capacity off the LCD, in this case 5T. All of that being said, now that we're only storing recent revisions, we're definitely over-provisioned on storage space, so it begs the question (especially if this is going to result in a mass-replacement of Samsung SSDs), what do we need storage-wise?

One option: We order 5 @ 960G Intel and keep host storage at ~5T.

Another option: We order 2 @ 2T (provided 2 devices would provide enough IO), or 4 @ 960G and make host storage ~4T. If we did this, it would be possible to reclaim 1 1.6T SSD from each of the hosts equipped with them, giving us enough to save us on SSDs for two machines (and it might be possible to do this online).

@RobH as a first step, could we ensure that we have no spare Intel disks laying around that we could plug in, even if they are out of warranty? It would be good to first assert that swapping Samsung SSDs with something else confirms the current theory that the disks are to blame. So, let me rephrase the question: do we have any non-Samsung disks laying around that we could plug in restbase1010?

do we have any non-Samsung disks laying around that we could plug in restbase1010?

Not enough to fill the data capacity requirements, no.

We have a bunch of old Intel 320 series, however they are VERY old. So old that I'm not sure I'd trust them as a performance test for anything. They are also in 300GB, so it would take a lot of them. I do not suggest them for this test.

We don't keep new Intel SSDs in shelf spares, since we only order them as supported/in warranty hardware for our Dell and HP systems. Since in warranty hardware has a next business day replacement policy, it just wasn't required. The closest to new SSDs I have are 5 Intel DC S3500 Series SSDSC2BB300G401 2.5" 300GB, or 1 Intel SD DCS3700 SSDSC2BA400G3 400GB. Both of these are at least only a couple of years old in terms of Intel SSD models, but are still not as new as the Intel S3610 SSDs used in some restbase systems.

Ah I see, thnx for info @RobH. I agree, these wouldn't make good candidates for what we need. Ok, let's proceed with the quote for both types of disks and reconvene once we have them.

The order for this is escalated for placement. This should arrive sometime next week. (Just updating this public task, since the order task is private.)

@mobrovac The 5 ssds arrived for restbase1010. Do you need to schedule down time to replace?

@mobrovac The 5 ssds arrived for restbase1010. Do you need to schedule down time to replace?

We'll need to decommission the 3 instances running there first. After the disks are swapped, we'll need someone to do the re-image before we can bootstrap. I'm guessing that will be @fgiunchedi, so we should probably confirm his availability before we begin.

@mobrovac The 5 ssds arrived for restbase1010. Do you need to schedule down time to replace?

We'll need to decommission the 3 instances running there first. After the disks are swapped, we'll need someone to do the re-image before we can bootstrap. I'm guessing that will be @fgiunchedi, so we should probably confirm his availability before we begin.

From IRC:

12:26 < godog> yeah this week and the next I'm able to help
12:26 <+urandom> ok, would it be premature to start decommissioning now?
12:26 <+urandom> i guess it'd be done tomorrow, thursday tops
...
12:31 < godog> urandom: starting decom now WFM

Mentioned in SAL (#wikimedia-operations) [2018-04-17T16:34:53Z] <urandom> decommissioning Cassandra, restbase1010-a -- T189822

Mentioned in SAL (#wikimedia-operations) [2018-04-18T15:09:27Z] <urandom> decommissioning Cassandra, restbase1010-b -- T189822

Mentioned in SAL (#wikimedia-operations) [2018-04-19T03:18:44Z] <urandom> decommissioning Cassandra, restbase1010-c -- T189822

Update:

The decommission of restbase1010-c was discontinued after other instances in the rack began to fail (1016-{a,b,c}, 1010-b & 1007-a). The failures in question all seem to be the result of a JVM out of memory (as opposed to an application OutOfMemory exception).

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2643), pid=2899, tid=0x00007f1dcd4bb700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-8u151-b12-1~bpo8+1-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#

[ ... ]

One possibility is that vm.max_map_count may have been set too low (it defaults to 65530, and the typical recommendation for Cassandra is 1048575). This has never been an issue for us before, but these nodes were under considerable IO pressure (the newer storage strategy is more IO intensive, combined with the (anti)compaction workload of the decommission). I have configured the nodes in rack A to vm.max_map_count=1048575 ephemerally for the time being.

Once compaction is quiescent, we can kick off a round of cleanups (on restbase1007 and restbase1011 at least). Both to exercise compaction more before reattempting the decommission, and to ensure we don't paint ourselves into a corner over free space (we're as high as 83% on some devices).

Mentioned in SAL (#wikimedia-operations) [2018-04-19T20:48:12Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter -- T189822, T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-19T20:48:24Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-a -- T189822, T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-19T21:11:56Z] <urandom> restarting cassandra to (temporarily) rollback prometheus jmx exporter, restbase1010-c -- T189822, T192456

Mentioned in SAL (#wikimedia-operations) [2018-04-19T21:15:54Z] <urandom> Start cleanup, restbase10{07,11,16}-a -- T189822

Mentioned in SAL (#wikimedia-operations) [2018-04-19T21:22:38Z] <urandom> Start cleanup, restbase10{07,11,16}-b -- T189822

Mentioned in SAL (#wikimedia-operations) [2018-04-19T21:41:45Z] <urandom> Start cleanup, restbase10{07,11,16}-c -- T189822

Mentioned in SAL (#wikimedia-operations) [2018-04-23T13:53:31Z] <urandom> decommissioning Cassandra, restbase1010-c -- T189822

All 3 instances are now decommissioned. @fgiunchedi we're ready when you are to have the disks swapped and the host re-imaged!

Mentioned in SAL (#wikimedia-operations) [2018-04-24T08:05:03Z] <godog> power off restbase1010 for ssd replacement - T189822

@Cmjohnson restbase1010 is powered down and ready to have all of its ssd swapped

Mentioned in SAL (#wikimedia-operations) [2018-04-24T15:59:13Z] <godog> reimage restbase1010 after ssd swap - T189822

I've gone ahead and reimaged restbase1010, all cassandra instances are masked ATM but the host is otherwise good to be tested again.

Mentioned in SAL (#wikimedia-operations) [2018-04-24T16:57:07Z] <urandom> starting Cassandra bootstrap, restbase1010-a -- T189822

Mentioned in SAL (#wikimedia-operations) [2018-04-24T20:27:17Z] <urandom> starting Cassandra bootstrap, restbase1010-b -- T189822

Mentioned in SAL (#wikimedia-operations) [2018-04-24T23:29:16Z] <urandom> starting Cassandra bootstrap, restbase1010-c -- T189822

All 3 instances of 1010 have been bootstrapped.

Mentioned in SAL (#wikimedia-operations) [2018-04-25T17:35:20Z] <urandom> starting cleanups on row 'a' Cassandra nodes -- T189822

With the replacements done, the matrix of machine/controller and SSD combinations is complete; This confirms that it is the Samsung SSDs to blame for the poor IO performance we're seeing.

Screenshot-2018-5-10 Grafana - Cassandra System.png (837×1 px, 133 KB)
cpu iowait

T189057: Understand (and if possible, improve) cluster performance under the new storage strategy can be used for further follow-up; Closing this issue as Resolved.

RobH closed subtask Unknown Object (Task) as Resolved.May 31 2018, 4:39 PM