Page MenuHomePhabricator

Additional diskspace of wdqs1001/wdqs1002
Closed, ResolvedPublic

Description

Currently wdqs1001/wdqs1002 have only 200G SSD disks, which is OK for day-to-day operation, however given that the DB size is around 100G now, if there is a maintenance task that needs to be performed - like compacting the database, or some other like that - there is no disk space for that. I would like to request additional diskspace to be installed - which does not have to be SSD disks as it will not be used in day-to-day operations but any reasonable (100G+) spinning SATA drive would do.

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev added projects: hardware-requests, SRE.
Smalyshev added subscribers: Smalyshev, Joe.
fgiunchedi triaged this task as Medium priority.Dec 1 2015, 4:03 PM
fgiunchedi subscribed.
RobH added subscribers: mark, RobH.

Ok some working notes:

Both wdqs1001/wdqs1002 are Dell Poweredge R420s that have space for 8 total SFF (2.5") disks. We presently have dual 300GB installed into each.

robh@wdqs1002:~$ sudo megacli -PDList -aALL
                                     
Adapter #0

Enclosure Device ID: 32
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 500151795961c7a4
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 279.460 GB [0x22eec130 Sectors]
Non Coerced Size: 278.960 GB [0x22dec130 Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: 0302
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x4433221104000000
Connected Port Number: 0(path0) 
Inquiry Data: CVPR130402SH300EGN  INTEL SSDSA2CW300G3                     4PC10302
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Solid State Device
Drive:  Not Certified
Drive Temperature : N/A
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 3.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 32
Slot Number: 1
Drive's position: DiskGroup: 0, Span: 0, Arm: 1
Enclosure position: 1
Device Id: 1
WWN: 5001517bb2844a0b
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 279.460 GB [0x22eec130 Sectors]
Non Coerced Size: 278.960 GB [0x22dec130 Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: 0362
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x4433221105000000
Connected Port Number: 1(path0) 
Inquiry Data: CVPR206200LX300EGN  INTEL SSDSA2CW300G3                     4PC10362
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Solid State Device
Drive:  Not Certified
Drive Temperature : N/A
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 3.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

Exit Code: 0x00

These appear to be in a single raid 1 at this time. We cannot really add a single disk in, without breaking the entire redundancy aspect of the raid/data arrays.

So I would recommend that we upgrade by adding two more disks to each from our on site spares, of which we have 38 Intel 320 Series SSDSA2CW300G3 2.5" 300GB. These are the identical model. I suggest we add in two more per system, and then reinstall the system from a single raid1 array into a 4 disk raid10 array. We could do this upgrade to one system and have it come fully back online before installing in the second.

The only costs I see is the rolling maintainance/reinstallation of each system (leaving one of the two in full service at all times). We have 38 Intel 320 SSDs, as they are an older model spare we do not use in our newer SSD based deployments.

@Smalyshev: Would you please review my above suggestion and provide feedback? If you agree, we can assign this task to @mark to approve the use of allocated spares for upgrades. (The use of spares for repair is automatic, but stealing spares for upgrade requires we ensure its acceptable.) If you do not agree, please assign this back to me with your corrections.

Thanks!

@RobH would it require full reimage or can be done incrementally while preserving existing data? If the latter, sounds good to me to proceed anytime, if it needs reimage then let's coordinate when it could be done.

It would require a reimage to replace the raid1 with a raid10. However, it would be the best option in the long run, as a raid10 can then have additional disk pairs added without reimage. (So adding any further disks in the future shouldn't require reimage, as long as our reinstall this time includes the raid10 and then an LVM before our parititons.)

So we would need to coordinate when it could be done, but if that is acceptable to you we can kick this up to @mark for his approval of implementation.

btw, it should be noted that the LVM on those disks does not cover the whole free space as of now, so you still have room to grow;

Also, we could add two new disks in raid1 and add them to the volume group of the LVM if we want to, without need for reinstalling the server.

I don't have enough expertise to decide which way is better, but I'd of course prefer one that does not require reimage if possible and it does not have to many downsides.

Adding 2 drives (or SSDs if we have them) and extending the LVM VG seems easy and cheap to do, let's move forward with that. RobH: approved to allocate what you consider best fit.

drive-by comment: partman will likely to be adjusted too so we don't run into surprises when reprovisioning

Well, we have @mark's approval to use the on-site spares for this (as we have plenty.)

Now the only question is how best to implement.

We could do as @Joe suggests and add the two additional disks as raid1 and extend the LVM onto them.

Ideally, we setup future installs to be raid10 across all 4 disks. If we cannot afford any reimage time now, we can still do that. As @fgiunchedi points out, we'll want to ensure partman is updated so when reinstall does roll around, its not broken.

@Smalyshev: If we can afford the downtime, I think its better longterm to reimage them with raid10 and a new lvm. If we cannot afford the downtime, then we can do as @Joe suggests and simply raid1 add the new disks and extend the LVM.

Please advise (both @Smalyshev and anyone else with knowledge on the subject)

I'll create a sub-task to pull and add the additional disks, but with a note not to install them without checking first. (Seating them shouldn't cause any issues, but we should be certain and actively monitoring the system during this.)

My issue is if we toss these in as a raid1 and extend the LVM, that is NOT how we would typically install it from scratch. So when these systems eventually upgrade and reinstall the distro, it'll involve changing the raid setup and potentially mount points, etc... I prefer to reimage and tackle these issues at the time of the disk upgrade, rather than push down the road.

I think we can add the disks now to raid1 and file a task to reimage them somewhere in January.

That sounds reasonable to me. I'll create the sub-tasks for the on-site installation of the disks and implementation of both raid1 and space + future reimage.

Resolving this task as the onsite task now exists.