To have a short-term solution to address the current reliability and performance problems with the Labs NFS storage, we should migrate the underlying RAID volumes from RAID6 to RAID10.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | yuvipanda | T105720 Labs team reliability goal for Q1 2015/16 | |||
Resolved | coren | T106479 Ensure that labstore machine is 'known good' hardware | |||
Resolved | • chasemp | T98183 labstore1002 issues while trying to reboot | |||
Resolved | • chasemp | T101741 Locate and assign some MD1200 shelves for proper testing of labstore1002 | |||
Resolved | coren | T96063 Migrate Labs NFS storage from RAID6 to RAID10 | |||
Resolved | coren | T101011 Rsync live labstore filesystem to local eqiad copy | |||
Resolved | mark | T101010 Make a block-level copy of the codfw mirror of labstore1001 to eqiad |
Event Timeline
Raid 6 is a performance bottleneck but gives us 66% more effective storage than raid10 would in the current configuration. It doesn't mean that moving away from raid6 is inconceivable, but that doing so has a significant cost (as well as being a complicated operation).
Setting priority to Low since this lives in the nice-to-have domain, and that any investigation for phasing NFS out (which is already in the long-term plans) would imply reconsideration of the backing block devices anyways.
As we've seen, RAID6 has serious performance issues under write heavy load, and isn't especially great at recovering from unclean shutdown (but we could address the latter to some degree with write-intent bitmaps and kernel updates) or disk failure->recovery scenarios.
RAID6 is what we're using in this case, but the same basic arguments apply to all of RAID[456], which are all variations on the same theme of parity. Parity recalculation is expensive in terms of physical disk traffic and CPU load, and tends not to scale well. The reason everyone uses RAID6 instead of RAID5 these days mostly boils down to the math, and that same math also indicates that even RAID6 is slowly becoming unacceptable. Note this ACM paper: http://queue.acm.org/detail.cfm?id=1670144 (quicker/easier zdnet summary here: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/ ) . It was written in 2009, and we're now 6/10 years through the remaining 10 year life of RAID6 practicality the author mentions there before we have to move away from RAID6 to something based on triple-parity, because the RAID6 math on capacity vs MTBF vs rebuild time, etc keeps getting worse over time.
Regardless of the above there's always price:perf:space tradeoffs between RAID6 and RAID10, but there's also tradeoffs involve in giving any fixed amount of space to the users in the first place. It would be nice if we had twice the available capacity we do today. If there were a rational reason that we didn't, it would be "why waste the money? Nobody's demanded it that hard yet". But if you did it, users would find ways to expand into it. People tend to size their demands to availability in a lot of cases like these (because in truth, their hard need for space is way lower than their apparent usage of it), and I think the tradeoff is virtually always there at today's storage prices to avoid RAID6 and all of the associated tradeoffs, all of which are negative except for the "more space" bit.
All of the above might have different answers if we were using hardware-based enterprise-grade external RAID arrays with dedicated high-performance controllers and caches. But we're not (and I don't think I'd recommend that, either!). We're using Linux software RAID.
Even in the RAID10 world, there are similar looming issues for mirroring based on disk capacity + failure rate math. The short answers there are: write intent bitmaps help in that case (as they do in every such case) to speed rebuild times, and that using a larger count of smaller disks rather than a smaller count of larger disks still helps with those tradeoffs without significant caveat in the RAID10 world as well. You can also start putting hot spare drives in place (to reduce the total loss-of-redundancy window by taking human reaction + physical stuff out of the window), and then later on in the global timeline of these factors, you really have to start considering triple-mirroring for large datasets where periodic backups and hot spares are insufficient protection.
So right now, we have five shelves of disks, and
/dev/mapper/store-now 40T 11T 30T 27% /srv/project
So about 72% free. What's preventing us from moving to RAID10 and taking the hit on space lost? I guess it'll be a slow operation involving a lot of juggling, but @BBlack makes a convincing case that this is the way to move forward (we've had disk corruption with unclean shutdown already).
With the current stability and performance problems of NFS with RAID6, this is definitely not a "nice to have" but something that needs to be fixed ASAP. Storage capacity is far, FAR less important than stability at this point.
Given that the plans that necessitated the overprovisioning (a seperate Labs in codfw) no longer exists, we do indeed have sufficient space to use raid10 rather than raid6 (60T suffices for primary storage and snapshots, the extra 30T were needed if we had cross-DC storage replicated).
Migrating the logical volumes between raid6 and raid10 will be a several-step manoeuvre as we need to succesively evacuate the individual raid devices (being 1:1 with physical shelves), bebuild the array as raid10, and move the volume from pv subset to pv subset. This can be done while the filesystem is live, but is both a delicate operation and very I/O intensive (on the plus side, the writes to the new pv will not be amplified by the raid6 setup).
Because the actually available number of pe will diminish after each successive conversion, the actual number of operations will likely need to increase (think towers of hanoi).
@coren: Where can I see the mapping of raid array (md125 etc) to shelf? Is this documented?
@mark: It's in the slides (https://commons.wikimedia.org/wiki/File:WMF_Labs_storage_presentation.pdf) but also ridiculously straightforward: shelves are mapped 1:1 to /dev/md/slice[1-5].
A note: while it will probably increase the amount of necessary juggling, the entire setup would be immensely improved with raid10 if - rather than one shelf one array - we put 6/6 drives per array on /consecutive/ shelves (so 1-2, 2-3, 3-4, 4-5, 5-1). One average, this will not generally improve disk bandwidth but it will make the arrays imprevious to loosing a shelf briefly.
Can you work out a plan, and list all the individual steps (ideally with command line invocations) on this ticket?
The plan is to gradually evacuate the pv on individual raid arrays with pvmove, reconfigure the freed raid arrays with raid 10, and recreate pv on the new arrays to be used as destination for further moves. slice5 is currently mostly empty already, and slice1 is currently allocated only to the (now obsolete) safety copy (pre-thin volumes) store/project. So:
- drop the store/project volume
- remove slice1 pv
- pvmove the contents of slice5 away
- drop slice5 pv
- create the new 'slice51' array accross the end of slice5/start of slice1
- pvmove what is left on slice2 away (most was the rest of store/project, the rest fits)
- drop slice2
- create 'slice12' (end of slice1, start of slice2)
At this point, we don't have quite enough room for removing another pv with the volume sizes as they are. We can safely shave approximately 12T off the thin pool without risking running out of space - freeing most of slice4
- pvmove slice4 away (slice3 is the busiest, keeping for last)
- drop slice4
- create 'slice45'
- pvmove slice3 away (this leaves us with roughly 4t to spare)
- drop slice4
- create slice23 and slice34
- restore the thin pool size
A further optimization may be possible at that point by moving around store/journal to live on the least use slice.
An alternative plan, based on input from @mark, that front loads the thin pool move to give performance improvement earlier. With a bit of extra juggling (because it's a bit harder to free the second slice in that scenario):
- drop the store/project volume
- remove slice1 pv
- pvmove the contents of slice5 away to free room on slice2
- drop slice5 pv
- create slice51
- reduce the store/space thin pool to 20T
- pvmove the tail end of store/space from slice4 to slice51
- drop slice4 pv
- create slice45
- pvmove tail end of slice3 to what is left of slice51
- pvmove rest of slice3 to slice45
- drop slice3 pv
- create slice34
- pvmove remainder of slice2 to slice34
- drop slice2 pv
- create slice12 and slice23
- restore thin pool size
This seems reasonable yes. Let's move ahead with this after the new backups have finished (codfw + eqiad).
Starting this, gingerly, for the parts which are not on (the currently overloaded) slice3
Process halted for now (mid step 3): the temporary backup snapshot made for T101011 lives on slice5 and, because it is a live snapshot, cannot be moved.
This can be resumed as soon as the snapshot is dropped at the conclusion of the rsync (this is the last lv on that pv)
The backup is complete, and the snapshot has been removed - freeing slice5 completely.
Resuming now at step 4.
Problem in step 6:
# lvreduce -L 22T store/space Thin pool volumes cannot be reduced in size yet.
Where "yet" means "ever" because CLOSED WONTFIX upstream,
Recalculating the necessary volume moves; this remains doable but may now require intermediate steps and a temporary pv.
Now needs to be:
- pvmove journal, keys, tmeta and the tail end of space from slice2 to slice51 [1189581 free pe on slice51]
- remove slice2 pv
- create slice12 pv [4050187 pe available for 51+12]
- add dumps pv (from the internal controller) [2383838 extra pe]
- pvmove first 2383838 extents from slice3 to dumps
- pvmove remaining 2383839 extents from slice3 to slice12 [476767 free left]
- remove slice3 pv
- create slice23 pv
- pvmove slice4: 1189581->slice51, 476767->slice12, 2860606->slice23 [240723 left]
- create pv from free "half" of ex-slice5 [1430303 pe]
- pvmove 240723 remaining of slice4 to that temp pv
- remove slice4
- create slice34
- pvmove temp pv to slice34
- remove temp pv
- create slice45
- pvmove dumps pv to slice34, slice45
Phew.
journal, keys and tmeta migrated to slice51
After some wrangling with block io sizes (thin lvm volumes refuse block transations bigger than the page size), the move of space from slice2 to slice51 is now in progress, at a glacial pace.
(examination of the lvm mirror code used to copy the data shows that it serializes reads and writes with barriers, so the per-block latency is very high)
There is a screen running on labstore1001 with the pvmove, and ongoing status.
I think this was largely because the underlying new RAID10 md device was also doing a resync, at a glacier pace of 1000 KB/s as well. I attempted pausing it, but it automatically restarted. It seems very suboptimal to have both running at once, so I've aborted the pvmove instead, using:
pvmove --abort
And the resync is now going much faster. After that completes, we can restart the pvmove using:
pvmove --verbose /dev/md124 /dev/md122
This did speed up the resync a lot, but it was still going to take 1-2 days. Therefore I've restarted the pvmove, which I think allows the entire migration to finish sooner and be better overall. But before we take any steps after this first pvmove, let's reevaluate if we want to finish for the resync or not. If the next steps don't affect this raid array, they can happen in parallel.
The filesystem crashed caused us to... improvise around this plan a great deal. All but one project has been switched to a restored backup on raid10 - what remains to do now is to clean up the rest of the shelves once all that can be salvaged has been and raid10 them.