Ensure that labstore machine is 'known good' hardware
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Jul 22 2015, 8:47 AM

Description

labstore1002 seemed to have flaky hardware, ensure that either all the components that need replacing have been replaced or we have switched back to labstore 1001.

Related Objects
Search...

Status	Assigned	Task
Resolved	yuvipanda	T105720 Labs team reliability goal for Q1 2015/16
Resolved	coren	T106479 Ensure that labstore machine is 'known good' hardware
Resolved	coren	T95293 Inspect and diagnose labstore1001's H800 controler
Declined	coren	T94607 Test labstore switchover
Resolved	None	T85604 Storage capacity & redundancy expansion (tracking)
Resolved	None	T85606 Replicate data between codfw and eqiad
Resolved	coren	T85605 Set storage service up in codfw
Resolved	coren	T93740 Upgrade labstore2001 to Jessie
Declined	yuvipanda	T85608 Process for user backups
Resolved	coren	T93792 Sync up the new labs NFS project filesystem with the live one
Resolved	coren	T85607 Increase storage available to labs NFS server
Resolved	coren	T91640 Upgrade labstore1002 to Jessie
Resolved	• Cmjohnson	T91677 labstore1002 fails to enter PERC bios, hangs on detecting devices
Resolved	• chasemp	T98183 labstore1002 issues while trying to reboot
Resolved	• chasemp	T101741 Locate and assign some MD1200 shelves for proper testing of labstore1002
Resolved	coren	T96063 Migrate Labs NFS storage from RAID6 to RAID10
Resolved	coren	T101011 Rsync live labstore filesystem to local eqiad copy
Resolved	mark	T101010 Make a block-level copy of the codfw mirror of labstore1001 to eqiad
Resolved	• Cmjohnson	T101743 Locate spare H800 PERC in case it is necessary to switch labstore1002's
		Unknown Object (Task)
Resolved	coren	T107038 Switch NFS server back to labstore1001
Resolved	coren	T107574 Reinstall labstore1001 and make sure everything is puppet-ready
Resolved	faidon	T107507 Investigate whether to use Debian's jessie-backports
Declined	faidon	T108941 Make certain that jessie-backports is disabled fleetwide.

Event Timeline

yuvipanda created this task.Jul 22 2015, 8:47 AM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added a project: Cloud-Services.

yuvipanda added a subtask: T95293: Inspect and diagnose labstore1001's H800 controler.

yuvipanda added subscribers: coren, mark, Ricordisamoa and 2 others.

yuvipanda added a subtask: T98183: labstore1002 issues while trying to reboot.Jul 22 2015, 8:49 AM

Switching back to labstore1001 should be a high priority but - as it requires significant downtime - needs to be planned and a good time found. What needs to happen:

Make certain labstore1001 can in fact serve files (easily tested without outage)
Coordinate with @Springle as switching requires moving cabling around

At the selected window:

Stop NFS cleanly on labstore1002
Flush and unmount all labstore1002 filesystems (easiest done with a halt/poweroff)
Switch cabling of the shelves from 1002 to 1001
[Optional: power 1002 back on, make sure it's available to take over quickly by powering through the BIOS issue if needed]
Verify that the shelves are visible from 1001 properly and that all is well
Start NFS service on 1001

Chris can confirm for the time it takes to switch wiring around, but I expect no less than 10-15 minutes of downtime during the switch, which we need to (at least) double for safety.

We can roll back by switching the wiring back to 1002 and rebooting it - possibly with some struggle with the H800 BIOS (as we had when we switch during the crash recovery). Chris being on-site means that - in a pinch - we can even replace hardware in 1002 before kicking it back up.

All told, we should have a maintenance window no less than two hours - with luck we'll need only 15 minutes of it.

coren added a project: Labs-Sprint-107.Jul 27 2015, 3:11 PM

coren mentioned this in T98183: labstore1002 issues while trying to reboot.Jul 27 2015, 3:14 PM

coren moved this task from To Do to Doing on the Labs-Sprint-107 board.Jul 29 2015, 2:35 PM

coren moved this task from Doing to To Do on the Labs-Sprint-107 board.

coren closed subtask T107038: Switch NFS server back to labstore1001 as Resolved.Oct 28 2015, 3:48 PM

NFS service has been switched back to labstore1001; and labstore1002's controller is now being swapped out for a new instance.

coren closed subtask T95293: Inspect and diagnose labstore1001's H800 controler as Resolved.Oct 28 2015, 4:03 PM

• chasemp closed subtask T98183: labstore1002 issues while trying to reboot as Resolved.Feb 15 2017, 3:06 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:48 PM

Ensure that labstore machine is 'known good' hardwareClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Ensure that labstore machine is 'known good' hardware
Closed, ResolvedPublic
Actions

Related Objects
Search...