Storage capacity & redundancy expansion (tracking)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	coren
	Dec 31 2014, 2:41 PM

Description

Expand total available capacity and redundancy of labs storage (cross-DC)

Related Objects
Search...

Status	Assigned	Task
Resolved	yuvipanda	T105720 Labs team reliability goal for Q1 2015/16
Resolved	coren	T106479 Ensure that labstore machine is 'known good' hardware
Resolved	coren	T95293 Inspect and diagnose labstore1001's H800 controler
Declined	coren	T93589 Allow labstores to hot or warm swap in case of failure
Resolved	coren	T94609 Reinstall labstore1001 with Jessie
Declined	coren	T94607 Test labstore switchover
Resolved	None	T85604 Storage capacity & redundancy expansion (tracking)
Resolved	None	T85606 Replicate data between codfw and eqiad
Resolved	coren	T85605 Set storage service up in codfw
Resolved	coren	T93740 Upgrade labstore2001 to Jessie
Declined	yuvipanda	T85608 Process for user backups
Resolved	coren	T93792 Sync up the new labs NFS project filesystem with the live one
Resolved	coren	T85607 Increase storage available to labs NFS server
Resolved	coren	T91640 Upgrade labstore1002 to Jessie
Resolved	• Cmjohnson	T91677 labstore1002 fails to enter PERC bios, hangs on detecting devices

Event Timeline

coren created this task.Dec 31 2014, 2:41 PM

coren raised the priority of this task from to Needs Triage.

coren updated the task description. (Show Details)

coren added a project: Cloud-Services.

coren subscribed.

coren moved this task from Triage to Stalled on the Cloud-Services board.Feb 10 2015, 9:22 PM

coren renamed this task from Storage capacity & redundancy expansion to Storage capacity & redundancy expansion (tracking).Feb 10 2015, 9:25 PM

coren triaged this task as Medium priority.

coren set Security to None.

coren moved this task from Stalled to Tracking on the Cloud-Services board.

The new shelf has been added, and configured. Actual expansion is pending on thin volumes, which itself requires a backport of a recent version of lvm2 (which is nearly complete) - Precise has no working thin volume support.

(Upgrading labstore1001 [to Jessie] has been considered, but given the long downtime and the recent outages we decided against hitting the users again so soon)

Hardware happy in place and visible to the OS.

After another discussion with @faidon that concluded:
<Coren> paravoid: So you think it's better to Jessie up 1002 and switch to that instead?
<paravoid> I think so, yes

That plan involves a (brief) downtime for the switchover, less than 10 minutes, but is futureproof.

Dumb question: IIRC there are two disk arrays each connected to two NFS servers? Switching between the NFS servers requires clients to remount everything (aka reboot)?

Change 194537 had a related patch set uploaded (by coren):
labstore1002 to Jessie

https://gerrit.wikimedia.org/r/194537

gerritbot added a project: Patch-For-Review.Mar 5 2015, 3:52 PM

coren removed a project: Patch-For-Review.Mar 5 2015, 3:54 PM

@scfc: No, the NFS fsids are the same and the actual service IP is floating, so no remount is required from the clients.

coren closed subtask T91640: Upgrade labstore1002 to Jessie as Resolved.Mar 9 2015, 1:09 PM

The new filesystem is active and in place; the rsync is in progress (currently at iowait -c Idle) but will take some time. I'll discuss giving it more bandwidth during the ops meeting.

coren closed subtask T85607: Increase storage available to labs NFS server as Resolved.Mar 9 2015, 1:12 PM

So here is the current picture:

The new filesystem on thin volumes is in place and contains a copy of the live filesystem, but rsync is unable to keep up with the rate of change so actual downtime is unavoidable to do the actual switchover (a dry run at ionice takes ~20 hours!)
The new filesystem properly does snapshotting; we have local backups available
Replication of the latest snapshot to codfw works, but has performance issues until proper exclusions are put in place
All of those processes are tested and fully working over the new filesystem (which is not the live one)

What's needed for this to be done:

Upgrade labstore2001 to Jessie before we flip the switch (our last chance to do so)
Schedule a downtime to do the final rsync between old and new filesystem
- Same as was planned in January, 24h during which the filesystem will be readonly
- Once the copy is done, swap out the mountpoints
Flip the switch:
- Puppet class to install the scripts and stuff them in crontabs

What we might want to do afterwards:

Make the snapshots accessible to the endusers? Not trivial, but right now getting files off snapshots requires admin intervention. Sufficient for now, but won't scale.

coren closed subtask T85605: Set storage service up in codfw as Resolved.Mar 26 2015, 1:38 PM

Status update:

labstore2001 upgraded
copy done, mountpoint swap scheduled for today (Mar 30) 22h UTC

Todo:

Finish review/tweaks of replication code and flip the switch

Is there documentation (with a procedure to follow) for the "cold spare" redundancy yet?

How much additional space (storage expansion) has been made available by this?

@coren: see questions above, thanks!

mark added a parent task: T94607: Test labstore switchover.Apr 7 2015, 10:52 AM

mark added a parent task: T93589: Allow labstores to hot or warm swap in case of failure.

mark removed a parent task: T93589: Allow labstores to hot or warm swap in case of failure.

In T85604#1167208, @mark wrote:

How much additional space (storage expansion) has been made available by this?

An extra 25%, approximately 18T of usable space. In addition, the cleanup required by the transition between filesystems managed to clean another 5-6T of redundant backups leftovers from pmtpa.

I am resolving for now and it will be reviewed as part of T85604

• chasemp closed subtask T85606: Replicate data between codfw and eqiad as Resolved.Mar 2 2016, 11:32 PM

Danny_B added a project: Tracking-Neverending.May 5 2016, 7:35 PM

Storage capacity & redundancy expansion (tracking)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Storage capacity & redundancy expansion (tracking)
Closed, ResolvedPublic
Actions

Related Objects
Search...