Migrate Labs NFS storage from RAID6 to RAID10
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Apr 14 2015, 6:39 PM

Description

To have a short-term solution to address the current reliability and performance problems with the Labs NFS storage, we should migrate the underlying RAID volumes from RAID6 to RAID10.

Related Objects
Search...

Status	Assigned	Task
Resolved	yuvipanda	T105720 Labs team reliability goal for Q1 2015/16
Resolved	coren	T106479 Ensure that labstore machine is 'known good' hardware
Resolved	• chasemp	T98183 labstore1002 issues while trying to reboot
Resolved	• chasemp	T101741 Locate and assign some MD1200 shelves for proper testing of labstore1002
Resolved	coren	T96063 Migrate Labs NFS storage from RAID6 to RAID10
Resolved	coren	T101011 Rsync live labstore filesystem to local eqiad copy
Resolved	mark	T101010 Make a block-level copy of the codfw mirror of labstore1001 to eqiad

Event Timeline

yuvipanda created this task.Apr 14 2015, 6:39 PM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added projects: Cloud-Services, acl*sre-team.

yuvipanda added subscribers: yuvipanda, coren, BBlack.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2015, 6:39 PM

Raid 6 is a performance bottleneck but gives us 66% more effective storage than raid10 would in the current configuration. It doesn't mean that moving away from raid6 is inconceivable, but that doing so has a significant cost (as well as being a complicated operation).

Setting priority to Low since this lives in the nice-to-have domain, and that any investigation for phasing NFS out (which is already in the long-term plans) would imply reconsideration of the backing block devices anyways.

As we've seen, RAID6 has serious performance issues under write heavy load, and isn't especially great at recovering from unclean shutdown (but we could address the latter to some degree with write-intent bitmaps and kernel updates) or disk failure->recovery scenarios.

RAID6 is what we're using in this case, but the same basic arguments apply to all of RAID[456], which are all variations on the same theme of parity. Parity recalculation is expensive in terms of physical disk traffic and CPU load, and tends not to scale well. The reason everyone uses RAID6 instead of RAID5 these days mostly boils down to the math, and that same math also indicates that even RAID6 is slowly becoming unacceptable. Note this ACM paper: http://queue.acm.org/detail.cfm?id=1670144 (quicker/easier zdnet summary here: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/ ) . It was written in 2009, and we're now 6/10 years through the remaining 10 year life of RAID6 practicality the author mentions there before we have to move away from RAID6 to something based on triple-parity, because the RAID6 math on capacity vs MTBF vs rebuild time, etc keeps getting worse over time.

Regardless of the above there's always price:perf:space tradeoffs between RAID6 and RAID10, but there's also tradeoffs involve in giving any fixed amount of space to the users in the first place. It would be nice if we had twice the available capacity we do today. If there were a rational reason that we didn't, it would be "why waste the money? Nobody's demanded it that hard yet". But if you did it, users would find ways to expand into it. People tend to size their demands to availability in a lot of cases like these (because in truth, their hard need for space is way lower than their apparent usage of it), and I think the tradeoff is virtually always there at today's storage prices to avoid RAID6 and all of the associated tradeoffs, all of which are negative except for the "more space" bit.

All of the above might have different answers if we were using hardware-based enterprise-grade external RAID arrays with dedicated high-performance controllers and caches. But we're not (and I don't think I'd recommend that, either!). We're using Linux software RAID.

Even in the RAID10 world, there are similar looming issues for mirroring based on disk capacity + failure rate math. The short answers there are: write intent bitmaps help in that case (as they do in every such case) to speed rebuild times, and that using a larger count of smaller disks rather than a smaller count of larger disks still helps with those tradeoffs without significant caveat in the RAID10 world as well. You can also start putting hot spare drives in place (to reduce the total loss-of-redundancy window by taking human reaction + physical stuff out of the window), and then later on in the global timeline of these factors, you really have to start considering triple-mirroring for large datasets where periodic backups and hot spares are insufficient protection.

Krenair subscribed.Apr 15 2015, 5:25 PM

Ltrlg subscribed.Apr 19 2015, 5:30 PM

So right now, we have five shelves of disks, and

/dev/mapper/store-now    40T   11T   30T  27% /srv/project

So about 72% free. What's preventing us from moving to RAID10 and taking the hit on space lost? I guess it'll be a slow operation involving a lot of juggling, but @BBlack makes a convincing case that this is the way to move forward (we've had disk corruption with unclean shutdown already).

yuvipanda set Security to None.May 5 2015, 11:58 PM

yuvipanda added subscribers: faidon, mark.

mark raised the priority of this task from Medium to High.May 7 2015, 9:33 AM

In T96063#1207581, @coren wrote:

Raid 6 is a performance bottleneck but gives us 66% more effective storage than raid10 would in the current configuration. It doesn't mean that moving away from raid6 is inconceivable, but that doing so has a significant cost (as well as being a complicated operation).

Setting priority to Low since this lives in the nice-to-have domain, and that any investigation for phasing NFS out (which is already in the long-term plans) would imply reconsideration of the backing block devices anyways.

With the current stability and performance problems of NFS with RAID6, this is definitely not a "nice to have" but something that needs to be fixed ASAP. Storage capacity is far, FAR less important than stability at this point.

In T96063#1268232, @mark wrote:

With the current stability and performance problems of NFS with RAID6, this is definitely not a "nice to have" but something that needs to be fixed ASAP. Storage capacity is far, FAR less important than stability at this point.

Given that the plans that necessitated the overprovisioning (a seperate Labs in codfw) no longer exists, we do indeed have sufficient space to use raid10 rather than raid6 (60T suffices for primary storage and snapshots, the extra 30T were needed if we had cross-DC storage replicated).

Migrating the logical volumes between raid6 and raid10 will be a several-step manoeuvre as we need to succesively evacuate the individual raid devices (being 1:1 with physical shelves), bebuild the array as raid10, and move the volume from pv subset to pv subset. This can be done while the filesystem is live, but is both a delicate operation and very I/O intensive (on the plus side, the writes to the new pv will not be amplified by the raid6 setup).

Because the actually available number of pe will diminish after each successive conversion, the actual number of operations will likely need to increase (think towers of hanoi).

@coren: Where can I see the mapping of raid array (md125 etc) to shelf? Is this documented?

mark renamed this task from Investigate ways of getting off raid6 for labs store to Migrate Labs NFS storage from RAID6 to RAID10.May 8 2015, 8:34 AM

mark updated the task description. (Show Details)

@mark: It's in the slides (https://commons.wikimedia.org/wiki/File:WMF_Labs_storage_presentation.pdf) but also ridiculously straightforward: shelves are mapped 1:1 to /dev/md/slice[1-5].

A note: while it will probably increase the amount of necessary juggling, the entire setup would be immensely improved with raid10 if - rather than one shelf one array - we put 6/6 drives per array on /consecutive/ shelves (so 1-2, 2-3, 3-4, 4-5, 5-1). One average, this will not generally improve disk bandwidth but it will make the arrays imprevious to loosing a shelf briefly.

Can you work out a plan, and list all the individual steps (ideally with command line invocations) on this ticket?

The plan is to gradually evacuate the pv on individual raid arrays with pvmove, reconfigure the freed raid arrays with raid 10, and recreate pv on the new arrays to be used as destination for further moves. slice5 is currently mostly empty already, and slice1 is currently allocated only to the (now obsolete) safety copy (pre-thin volumes) store/project. So:

drop the store/project volume
remove slice1 pv
pvmove the contents of slice5 away
drop slice5 pv
create the new 'slice51' array accross the end of slice5/start of slice1
pvmove what is left on slice2 away (most was the rest of store/project, the rest fits)
drop slice2
create 'slice12' (end of slice1, start of slice2)

At this point, we don't have quite enough room for removing another pv with the volume sizes as they are. We can safely shave approximately 12T off the thin pool without risking running out of space - freeing most of slice4

pvmove slice4 away (slice3 is the busiest, keeping for last)
drop slice4
create 'slice45'
pvmove slice3 away (this leaves us with roughly 4t to spare)
drop slice4
create slice23 and slice34
restore the thin pool size

A further optimization may be possible at that point by moving around store/journal to live on the least use slice.

An alternative plan, based on input from @mark, that front loads the thin pool move to give performance improvement earlier. With a bit of extra juggling (because it's a bit harder to free the second slice in that scenario):

drop the store/project volume
remove slice1 pv
pvmove the contents of slice5 away to free room on slice2
drop slice5 pv
create slice51
reduce the store/space thin pool to 20T
pvmove the tail end of store/space from slice4 to slice51
drop slice4 pv
create slice45
pvmove tail end of slice3 to what is left of slice51
pvmove rest of slice3 to slice45
drop slice3 pv
create slice34
pvmove remainder of slice2 to slice34
drop slice2 pv
create slice12 and slice23
restore thin pool size

This seems reasonable yes. Let's move ahead with this after the new backups have finished (codfw + eqiad).

coren added a project: Labs-Sprint-100.Jun 1 2015, 5:41 PM

mark moved this task from To Do to Doing on the Labs-Sprint-100 board.Jun 1 2015, 5:51 PM

mark moved this task from Doing to To Do on the Labs-Sprint-100 board.

coren closed subtask T101010: Make a block-level copy of the codfw mirror of labstore1001 to eqiad as Resolved.Jun 8 2015, 12:41 PM

coren added a project: Labs-Sprint-101.Jun 8 2015, 12:44 PM

Starting this, gingerly, for the parts which are not on (the currently overloaded) slice3

Process halted for now (mid step 3): the temporary backup snapshot made for T101011 lives on slice5 and, because it is a live snapshot, cannot be moved.

This can be resumed as soon as the snapshot is dropped at the conclusion of the rsync (this is the last lv on that pv)

coren moved this task from Doing to Code Review / Blocked on the Labs-Sprint-101 board.Jun 8 2015, 3:21 PM

coren added a parent task: T101741: Locate and assign some MD1200 shelves for proper testing of labstore1002.Jun 8 2015, 5:54 PM

faidon added a project: Labs-Sprint-102.Jun 15 2015, 5:37 PM

faidon moved this task from To Do to Code Review / Blocked on the Labs-Sprint-102 board.

coren closed subtask T101011: Rsync live labstore filesystem to local eqiad copy as Resolved.Jun 16 2015, 5:38 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 16 2015, 5:38 PM

coren moved this task from Code Review / Blocked to Doing on the Labs-Sprint-102 board.Jun 16 2015, 5:45 PM

The backup is complete, and the snapshot has been removed - freeing slice5 completely.

Resuming now at step 4.

Problem in step 6:

# lvreduce -L 22T store/space
  Thin pool volumes cannot be reduced in size yet.

Where "yet" means "ever" because CLOSED WONTFIX upstream,

Recalculating the necessary volume moves; this remains doable but may now require intermediate steps and a temporary pv.

Now needs to be:

pvmove journal, keys, tmeta and the tail end of space from slice2 to slice51 [1189581 free pe on slice51]
remove slice2 pv
create slice12 pv [4050187 pe available for 51+12]
add dumps pv (from the internal controller) [2383838 extra pe]
pvmove first 2383838 extents from slice3 to dumps
pvmove remaining 2383839 extents from slice3 to slice12 [476767 free left]
remove slice3 pv
create slice23 pv
pvmove slice4: 1189581->slice51, 476767->slice12, 2860606->slice23 [240723 left]
create pv from free "half" of ex-slice5 [1430303 pe]
pvmove 240723 remaining of slice4 to that temp pv
remove slice4
create slice34
pvmove temp pv to slice34
remove temp pv
create slice45
pvmove dumps pv to slice34, slice45

Phew.

journal, keys and tmeta migrated to slice51

After some wrangling with block io sizes (thin lvm volumes refuse block transations bigger than the page size), the move of space from slice2 to slice51 is now in progress, at a glacial pace.

(examination of the lvm mirror code used to copy the data shows that it serializes reads and writes with barriers, so the per-block latency is very high)

There is a screen running on labstore1001 with the pvmove, and ongoing status.

In T96063#1372238, @coren wrote:

journal, keys and tmeta migrated to slice51

After some wrangling with block io sizes (thin lvm volumes refuse block transations bigger than the page size), the move of space from slice2 to slice51 is now in progress, at a glacial pace.

(examination of the lvm mirror code used to copy the data shows that it serializes reads and writes with barriers, so the per-block latency is very high)

There is a screen running on labstore1001 with the pvmove, and ongoing status.

I think this was largely because the underlying new RAID10 md device was also doing a resync, at a glacier pace of 1000 KB/s as well. I attempted pausing it, but it automatically restarted. It seems very suboptimal to have both running at once, so I've aborted the pvmove instead, using:

pvmove --abort

And the resync is now going much faster. After that completes, we can restart the pvmove using:

pvmove --verbose /dev/md124 /dev/md122

In T96063#1373448, @mark wrote:

I think this was largely because the underlying new RAID10 md device was also doing a resync, at a glacier pace of 1000 KB/s as well. I attempted pausing it, but it automatically restarted. It seems very suboptimal to have both running at once, so I've aborted the pvmove instead, using:

This did speed up the resync a lot, but it was still going to take 1-2 days. Therefore I've restarted the pvmove, which I think allows the entire migration to finish sooner and be better overall. But before we take any steps after this first pvmove, let's reevaluate if we want to finish for the resync or not. If the next steps don't affect this raid array, they can happen in parallel.

The filesystem crashed caused us to... improvise around this plan a great deal. All but one project has been switched to a restored backup on raid10 - what remains to do now is to clean up the rest of the shelves once all that can be salvaged has been and raid10 them.

faidon closed this task as Resolved.Jun 22 2015, 5:37 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:52 PM

Migrate Labs NFS storage from RAID6 to RAID10Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Migrate Labs NFS storage from RAID6 to RAID10
Closed, ResolvedPublic
Actions

Related Objects
Search...