Expand total available capacity and redundancy of labs storage (cross-DC)
|Resolved||• yuvipanda||T105720 Labs team reliability goal for Q1 2015/16|
|Resolved||coren||T106479 Ensure that labstore machine is 'known good' hardware|
|Resolved||coren||T95293 Inspect and diagnose labstore1001's H800 controler|
|Declined||coren||T93589 Allow labstores to hot or warm swap in case of failure|
|Resolved||coren||T94609 Reinstall labstore1001 with Jessie|
|Declined||coren||T94607 Test labstore switchover|
|Resolved||None||T85604 Storage capacity & redundancy expansion (tracking)|
|Resolved||None||T85606 Replicate data between codfw and eqiad|
|Resolved||coren||T85605 Set storage service up in codfw|
|Resolved||coren||T93740 Upgrade labstore2001 to Jessie|
|Declined||• yuvipanda||T85608 Process for user backups|
|Resolved||coren||T93792 Sync up the new labs NFS project filesystem with the live one|
|Resolved||coren||T85607 Increase storage available to labs NFS server|
|Resolved||coren||T91640 Upgrade labstore1002 to Jessie|
|Resolved||Cmjohnson||T91677 labstore1002 fails to enter PERC bios, hangs on detecting devices|
The new shelf has been added, and configured. Actual expansion is pending on thin volumes, which itself requires a backport of a recent version of lvm2 (which is nearly complete) - Precise has no working thin volume support.
(Upgrading labstore1001 [to Jessie] has been considered, but given the long downtime and the recent outages we decided against hitting the users again so soon)
Hardware happy in place and visible to the OS.
After another discussion with @faidon that concluded:
<Coren> paravoid: So you think it's better to Jessie up 1002 and switch to that instead?
<paravoid> I think so, yes
That plan involves a (brief) downtime for the switchover, less than 10 minutes, but is futureproof.
So here is the current picture:
- The new filesystem on thin volumes is in place and contains a copy of the live filesystem, but rsync is unable to keep up with the rate of change so actual downtime is unavoidable to do the actual switchover (a dry run at ionice takes ~20 hours!)
- The new filesystem properly does snapshotting; we have local backups available
- Replication of the latest snapshot to codfw works, but has performance issues until proper exclusions are put in place
- All of those processes are tested and fully working over the new filesystem (which is not the live one)
What's needed for this to be done:
- Upgrade labstore2001 to Jessie before we flip the switch (our last chance to do so)
- Schedule a downtime to do the final rsync between old and new filesystem
- Same as was planned in January, 24h during which the filesystem will be readonly
- Once the copy is done, swap out the mountpoints
- Flip the switch:
- Puppet class to install the scripts and stuff them in crontabs
What we might want to do afterwards:
- Make the snapshots accessible to the endusers? Not trivial, but right now getting files off snapshots requires admin intervention. Sufficient for now, but won't scale.
An extra 25%, approximately 18T of usable space. In addition, the cleanup required by the transition between filesystems managed to clean another 5-6T of redundant backups leftovers from pmtpa.