Page MenuHomePhabricator

Enable TRIM for SSDs for Cassandra software raid
Closed, ResolvedPublic

Assigned To
Authored By
GWicke
Feb 15 2015, 8:11 AM
Referenced Files
F25868691: Screenshot from 2018-09-14 16-18-49.png
Sep 14 2018, 9:29 PM
F25830662: Screenshot from 2018-09-13 17-33-25.png
Sep 13 2018, 10:41 PM
F3366611: pasted_file
Feb 17 2016, 5:11 PM
F3366617: pasted_file
Feb 17 2016, 5:11 PM
F3366612: pasted_file
Feb 17 2016, 5:11 PM

Description

From what I see in puppet & on individual hosts I get the impression that we don't routinely enable the TRIM command for SSDs. TRIM helps to keep SSDs performing for longer by letting the SSD know which logical blocks are no longer needed.

There seem to be two main ways to do this in modern systems:

  1. discard option in fstab:
/dev/sda1  /       ext4   defaults,noatime,discard   0  1
  1. calling fstrim from cron. http://blog.neutrino.es/2013/howto-properly-activate-trim-for-your-ssd-on-linux-fstrim-lvm-and-dmcrypt/ talks about the pros of this approach.

See also

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: acl*sre-team.
GWicke subscribed.

Which hosts are you talking about specifically? SSDs that participate in HW RAID (even single-disk RAID0 hacks on H710s) do not passthrough TRIM to the underlying device last time I checked.

GWicke renamed this task from Enable TRIM for SSDs to Enable TRIM for SSDs?.Feb 20 2015, 4:34 AM
GWicke edited subscribers, added: fgiunchedi, akosiaris, chasemp; removed: Aklapper.

@faidon, it seems that linux md raid gained that capability in 3.7, so nowadays it might be worth even for raids.

Re hosts, I would think that either trim or over-provisioning are good ideas for all SSD installs, especially those using consumer / mid-level SSDs. The main benefit with TRIM is that it wastes less space.

I said HW RAID above, didn't I? :)

I said HW RAID above, didn't I? :)

Oh, sorry. I blame the lack of reading comprehension on me posting late at night ;)

So, that restricts this to SW RAIDs like the Cassandra boxes.

coren renamed this task from Enable TRIM for SSDs? to Enable TRIM for SSDs for Cassandra software raid.Mar 2 2015, 3:07 PM
coren triaged this task as Medium priority.
coren subscribed.

Changed title to reflect discussion

Also, assuming TRIM does reach the disk, you might want to do some perf testing and some research on how the drive is provisioned from the factory. I took an in-depth look at these issues on our cache SSDs recently, and basically what I found was that all of our cache machines' SSDs fell into one of three categories:

  1. No HW Raid, but disk doesn't support TRIM (e.g. older Intel M160 drives) even with newer kernels - in these cases we're leaving ~15% of the disk unallocated at the end in an unused partition. So long as that area is never written to, the drive firmware can use the free blocks to help itself out. This is basically manual overprovisioning.
  2. HW Raid w/ H710 - same as above, probably same disks underneath as well, but I haven't rebooted one to look.
  3. Intel S3700's w/o HW Raid - do support TRIM, but they also have a good amount of internal excess capacity that's never presented to the user (overprovisioned from the factory), and others' performance tests have indicated that manual overprovisioning doesn't really buy you any long term benefits because of this, on these drives. You can enable discard, but it's just going to slow down disk i/o a bit all the time for little benefit. You could also do periodic fstrim, but you'll see i/o spikes from the fstrim, so it's only really applicable in cases where there's a daily/weekly quiet period where you don't care about the i/o spike.

So, in the end, I didn't end up using fstrim or discard in any of these cases.

<s>It turns out that the Samsung 8xx series are now blacklisted as not handling trim properly in the kernel. This has lead to some data loss for users that had TRIM enabled.

So, lets not enable TRIM with the Samsung drives for now.</s>

Update: This was actually caused by a kernel bug in the Linux MD RAID-0 TRIM handling. Samsung has developed a patch.

Until a fix is merged and deployed to our servers, we should not be using TRIM with Linux MD RAID-0. RAID-1 is not affected.

The patch @GWicke linked to (a data corruption bug of discard on top of MD linear/raid0/raid10) was reworked and finally merged upstream as f3f5da624e0a891c34d8cd513c57f1d9b0c7dadc, aka v4.2-rc4~11^2. As of very recently, we have been experimenting with kernels 4.4 for other workloads and as a candidate for our next WMF kernel (to replace 3.19), so this could be something we could do at some point ofr the RESTBase nodes.

That said, I just checked upstream Linux and the Samsung 8xx are still blacklisted for queued TRIM even in master. Based on the fact that this blacklisting is still there, the fact that Algolia's blog post explicitly says "TRIM on our drives is un-queued and the issue we have found is not related to the latest changes in the Linux Kernel to disable this feature" and the fact that upstream commit 9a9324d3969678d44b330e1230ad2c8ae67acf81 says:

The queued TRIM problems appear to be generic to Samsung's firmware and
not tied to a particular model. A recent update to the 840 EVO firmware
introduced the same issue as we saw on 850 Pro.

...I think we're talking about two separate issues: one with TRIM & MD in general (fixed in 4.2+) and one with Samsung 8xx SSDs & queued TRIM (not fixed, but blacklisted/protected from data corruption).

This is all pretty disappointing — the lack of TRIM in our current setup is probably a major performance bottleneck given our utilization of those disks. These SSDs were a poor choice from the beginning; we should be more careful when deviating from well-known professional hardware (such as Intel SSDs) in the future (e.g. by evaluating a smaller sample before ordering big quantities and putting it in production).

This is all pretty disappointing — the lack of TRIM in our current setup is probably a major performance bottleneck given our utilization of those disks.

TRIM does not seem to be as critical to SSD peformance as it used to be in the early days. We have not seen significant changes in the ratio of IO throughput to iowait over the lifetime of these disks. iowait is generally on the order of 1-2%.

Read throughput:

pasted_file (1×1 px, 313 KB)

Write throughput:

pasted_file (1×1 px, 275 KB)

iowait:

pasted_file (685×887 px, 109 KB)

Better firmware in combination with a larger percentage of internal spare capacity (12.5% in the case of the 850 Pros) probably have a lot to do with this development.

These SSDs were a poor choice from the beginning; we should be more careful when deviating from well-known professional hardware (such as Intel SSDs) in the future (e.g. by evaluating a smaller sample before ordering big quantities and putting it in production).

The decision was made based on a cost / benefit analysis, and I believe this analysis is still correct.

GWicke lowered the priority of this task from Medium to Low.Feb 17 2016, 6:07 PM

This is all pretty disappointing — the lack of TRIM in our current setup is probably a major performance bottleneck given our utilization of those disks.

TRIM does not seem to be as critical to SSD peformance as it used to be in the early days. We have not seen significant changes in the ratio of IO throughput to iowait over the lifetime of these disks. iowait is generally on the order of 1-2%.

These 30ish MB/s on avg that these graphs show are ridiculously low to even make a dent on any SSD — this is essentially spindle performance. So, yes, there is not much I/O wait on those servers but that's because they're not doing much I/O-wise (for the most part).

When they do though, this is what happens:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda            7265.00  5407.20   51.00   32.00 33969.60 21504.00  1336.71   135.81 1508.60  604.67 2949.22  12.05 100.00
sdb            7265.20  5403.40   51.20   25.60 34208.00 17203.20  1338.83   147.55 1483.61  716.27 3018.31  13.02 100.00

(this is from just now, from restbase1008, which is md-syncing; 100% utilization with 30MB/s read/20MB/s write)

These SSDs were a poor choice from the beginning; we should be more careful when deviating from well-known professional hardware (such as Intel SSDs) in the future (e.g. by evaluating a smaller sample before ordering big quantities and putting it in production).

The decision was made based on a cost / benefit analysis, and I believe this analysis is still correct.

Calling it an analysis is kind of overstating of what happened. There were both extra costs (that we've paid dearly since) and risks involved that were identified back then; neither were quantified but were subsequently ignored because of "cost reasons". The fact that no testing happened before procuring a bunch of those SSDs, amplified those risks — and those concerns turned out to be very true and cost us extra, including the time spent by you and me on this 1-year old task, investigating this issue.

It may well be that for cost/benefit tradeoffs we should pick cheaper products than what are considered as best on the market — and that's fine, and we do so regularly. What I'm saying is that when we do, we should at least test those products before we actually build a whole infrastructure with them (and of course, incorporate the cost of testing them into our analysis). Anything else is just cutting corners.

GWicke changed the task status from Open to Stalled.Oct 12 2016, 4:42 PM

As I mentioned above in my second-to-last update, they are blacklisted for queued TRIM which is suboptimal of course. However, the data corruption issues with synchronous TRIM have been long resolved ­-- they were already back in 2016, they certainly seem to be in the kernels we're running with now.

I think synchronous TRIM would work and will probably make a huge difference. Mounting with discard will have that effect, but the problem with synchronous TRIM (AFAIK) is the potential for stalls on every delete, while you wait for the SSD to actually trim the data. Given our current state of performance, I doubt it can get worse, though :)

A perhaps better alternative would be a periodic (e.g. every hour via cron) invocation of fstrim --all.

As a next step, I'd recommend to just run that (fstrim --all) in one of the systems just once, and compare the performance before/after, either in a production system with real load, or in a system being benchmarked. That very quick test would give us an indication of what kind of benefit we can expect and whether it makes sense to investigate this further.

As I mentioned above in my second-to-last update, they are blacklisted for queued TRIM which is suboptimal of course. However, the data corruption issues with synchronous TRIM have been long resolved ­-- they were already back in 2016, they certainly seem to be in the kernels we're running with now.

I think synchronous TRIM would work and will probably make a huge difference. Mounting with discard will have that effect, but the problem with synchronous TRIM (AFAIK) is the potential for stalls on every delete, while you wait for the SSD to actually trim the data. Given our current state of performance, I doubt it can get worse, though :)

A perhaps better alternative would be a periodic (e.g. every hour via cron) invocation of fstrim --all.

As a next step, I'd recommend to just run that (fstrim --all) in one of the systems just once, and compare the performance before/after, either in a production system with real load, or in a system being benchmarked. That very quick test would give us an indication of what kind of benefit we can expect and whether it makes sense to investigate this further.

The idea of testing this in production fills me with all sorts of dread, but perhaps you're right and a controlled test is worth the risk.

Mentioned in SAL (#wikimedia-operations) [2018-09-13T21:33:15Z] <urandom> running fstrim --all on restbase1007 -- T89584

Mentioned in SAL (#wikimedia-operations) [2018-09-13T21:35:09Z] <urandom> running fstrim --all on restbase1011 -- T89584

First pass on restbase1007 running fstrim --all took several seconds to complete (a noteworthy delay, 10 seconds?) and exited 0. When run on restbase1011, the command completed almost instantaneously, making me somewhat suspicious. I added a second iteration to each host adding --verbose, restbase1007 outputs bytes trimmed for each mount point, restbase1011 does not (it outputs nothing).

eevans@restbase1007:~$ sudo fstrim --verbose --all
/srv/cassandra/instance-data: 1.5 GiB (1577857024 bytes) trimmed
/srv/sda4: 1.8 GiB (1880838144 bytes) trimmed
/srv/sde4: 1.6 GiB (1749524480 bytes) trimmed
/srv/sdd4: 3.9 GiB (4202110976 bytes) trimmed
/srv/sdb4: 1.3 GiB (1385988096 bytes) trimmed
/srv/sdc4: 1.6 GiB (1671745536 bytes) trimmed
/: 161.1 MiB (168964096 bytes) trimmed
eevans@restbase1007:~$
NOTE: The values above (for bytes trrimmed) probably aren't meaningful, this was a second iteration that followed the first by only a few minutes

Mentioned in SAL (#wikimedia-operations) [2018-09-13T21:51:12Z] <urandom> running fstrim --all on restbase1008 -- T89584

Here is restbase1008 (first pass) using --verbose

eevans@restbase1008:~$ sudo fstrim --verbose --all
/srv/cassandra/instance-data: 24.6 GiB (26378080256 bytes) trimmed
/srv/sdb4: 553.2 GiB (593950797824 bytes) trimmed
/srv/sdd4: 507.6 GiB (545052655616 bytes) trimmed
/srv/sdc4: 559.7 GiB (600988450816 bytes) trimmed
/srv/sda4: 565.6 GiB (607246553088 bytes) trimmed
/srv/sde4: 525.4 GiB (564115984384 bytes) trimmed
/: 22.4 GiB (24090587136 bytes) trimmed
eevans@restbase1008:~$

Mentioned in SAL (#wikimedia-operations) [2018-09-13T21:55:57Z] <urandom> running fstrim --all on restbase1013 -- T89584

Here is 1007 after an hour (along with 1016 which is Intel-equipped, for comparison). It's not worse. :)

Screenshot from 2018-09-13 17-33-25.png (796×1 px, 84 KB)
fstrim --all @ 21:33

We should probably also try a couple of hosts in codfw.

Here it is again after a day; This is definitely something, though not enough to be a game-changer (granted this is a pretty simplistic test).

Screenshot from 2018-09-14 16-18-49.png (794×1 px, 148 KB)
fstrim --all @ 2018-09-13T21:33

There is also the question of whether anything could be done for the HP machines with these devices; 17 out of 24 machines have the problematic SSDs, and 14 of those are HP (and AFAIK, all have the same controller).

Aklapper changed the task status from Stalled to Open.Nov 1 2020, 10:51 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for four years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

LSobanski claimed this task.
LSobanski subscribed.

Considering the age of this task, we're probably safe to close it. Please reopen if you think otherwise.