Page MenuHomePhabricator

Storage capacity & redundancy expansion (tracking)
Closed, ResolvedPublic

Description

Expand total available capacity and redundancy of labs storage (cross-DC)

Event Timeline

coren created this task.Dec 31 2014, 2:41 PM
coren raised the priority of this task from to Needs Triage.
coren updated the task description. (Show Details)
coren added a project: Cloud-Services.
coren added a subscriber: coren.
coren moved this task from Triage to Stalled on the Cloud-Services board.Feb 10 2015, 9:22 PM
coren renamed this task from Storage capacity & redundancy expansion to Storage capacity & redundancy expansion (tracking).Feb 10 2015, 9:25 PM
coren triaged this task as Normal priority.
coren set Security to None.
coren moved this task from Stalled to Tracking on the Cloud-Services board.
coren added a comment.Mar 2 2015, 7:40 PM

The new shelf has been added, and configured. Actual expansion is pending on thin volumes, which itself requires a backport of a recent version of lvm2 (which is nearly complete) - Precise has no working thin volume support.

(Upgrading labstore1001 [to Jessie] has been considered, but given the long downtime and the recent outages we decided against hitting the users again so soon)

coren added a subscriber: faidon.Mar 5 2015, 2:20 PM

Hardware happy in place and visible to the OS.

After another discussion with @faidon that concluded:
<Coren> paravoid: So you think it's better to Jessie up 1002 and switch to that instead?
<paravoid> I think so, yes

That plan involves a (brief) downtime for the switchover, less than 10 minutes, but is futureproof.

scfc added a subscriber: scfc.Mar 5 2015, 3:38 PM

Dumb question: IIRC there are two disk arrays each connected to two NFS servers? Switching between the NFS servers requires clients to remount everything (aka reboot)?

Change 194537 had a related patch set uploaded (by coren):
labstore1002 to Jessie

https://gerrit.wikimedia.org/r/194537

coren added a comment.Mar 5 2015, 3:56 PM

@scfc: No, the NFS fsids are the same and the actual service IP is floating, so no remount is required from the clients.

coren added a comment.Mar 9 2015, 1:11 PM

The new filesystem is active and in place; the rsync is in progress (currently at iowait -c Idle) but will take some time. I'll discuss giving it more bandwidth during the ops meeting.

So here is the current picture:

  • The new filesystem on thin volumes is in place and contains a copy of the live filesystem, but rsync is unable to keep up with the rate of change so actual downtime is unavoidable to do the actual switchover (a dry run at ionice takes ~20 hours!)
  • The new filesystem properly does snapshotting; we have local backups available
  • Replication of the latest snapshot to codfw works, but has performance issues until proper exclusions are put in place
  • All of those processes are tested and fully working over the new filesystem (which is not the live one)

What's needed for this to be done:

  • Upgrade labstore2001 to Jessie before we flip the switch (our last chance to do so)
  • Schedule a downtime to do the final rsync between old and new filesystem
    • Same as was planned in January, 24h during which the filesystem will be readonly
    • Once the copy is done, swap out the mountpoints
  • Flip the switch:
    • Puppet class to install the scripts and stuff them in crontabs

What we might want to do afterwards:

  • Make the snapshots accessible to the endusers? Not trivial, but right now getting files off snapshots requires admin intervention. Sufficient for now, but won't scale.
coren added a comment.Mar 30 2015, 5:49 PM

Status update:

  • labstore2001 upgraded
  • copy done, mountpoint swap scheduled for today (Mar 30) 22h UTC

Todo:

  • Finish review/tweaks of replication code and flip the switch
mark added a subscriber: mark.Mar 31 2015, 3:52 PM

Is there documentation (with a procedure to follow) for the "cold spare" redundancy yet?

mark added a comment.Mar 31 2015, 3:55 PM

How much additional space (storage expansion) has been made available by this?

mark added a comment.Apr 7 2015, 10:42 AM

@coren: see questions above, thanks!

coren added a comment.Apr 7 2015, 1:12 PM

How much additional space (storage expansion) has been made available by this?

An extra 25%, approximately 18T of usable space. In addition, the cleanup required by the transition between filesystems managed to clean another 5-6T of redundant backups leftovers from pmtpa.

chasemp closed this task as Resolved.Mar 2 2016, 11:30 PM
chasemp added a subscriber: chasemp.

I am resolving for now and it will be reviewed as part of T85604