Page MenuHomePhabricator

Expand SSD space in Cassandra cluster
Closed, ResolvedPublic

Description

Disk space especially in eqiad is getting very tight, with basically no margin for operations like re-shaping the cluster (ex: T121535). To avoid running out of disk space, we are currently holding back features like pre-generation of mobile HTML used by the Android app. We are also spending extra time on juggling the tight space by manually cancelling space-consuming compactions & fine-tuning the load distribution across nodes.

T119659 is adding a third SSD to each of restbase1007-9, thus bringing them up to the capacity of restbase1001-6. This will help slightly in the short term, but only brings us up to par with codfw. We will need more disk space headroom in the longer term.

The upgrade to Cassandra 2.1.12 (T120803) and the multi-instance setup (T95253) have significantly increased the amount of data a single hardware node can support. This means that we can increase our storage capacity by adding SSDs to existing nodes, using the eight available 2.5" SSD bays per chassis.

To minimize upgrade overheads, we are proposing to add 2x 1TB SSDs to each of the nine eqiad nodes, bringing them to five SSDs each. In codfw, we can reach almost identical capacity by adding 3 extra SSDs in each of the six hardware nodes.

Combined with efficiency improvements planned in T120171 for next quarter, this capacity should be sufficient to support all projects planned for this fiscal year.

Event Timeline

GWicke raised the priority of this task from to Normal.
GWicke updated the task description. (Show Details)
GWicke added projects: RESTBase, Operations.
GWicke added a subscriber: GWicke.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2015, 8:51 PM
GWicke updated the task description. (Show Details)Dec 15 2015, 8:54 PM
GWicke set Security to None.
GWicke updated the task description. (Show Details)
GWicke edited subscribers, added: RobH, Eevans, mark and 2 others; removed: Aklapper.
GWicke mentioned this in Unknown Object (Task).Dec 16 2015, 10:09 PM
GWicke raised the priority of this task from Normal to High.Jan 11 2016, 9:09 PM

@RobH @mark @faidon @fgiunchedi: This is urgent as we are still low on disk space across the cluster. I would appreciate if you could tackle this soon.

mark assigned this task to RobH.Jan 14 2016, 10:38 AM

@RobH: Alright, let's get some quotes for these SSDs. I assume we do have drive slots available?

mark added a comment.Jan 14 2016, 5:03 PM

@RobH: if we could get these SSDs quickly (perhaps), we might be able to save a lot of time on these migrations. Could you prioritize this ticket, and see what the delivery date would be if we ordered these quickly?

RobH shifted this object from the S1 Public space to the Restricted Space space.Jan 14 2016, 5:40 PM

Since this task will very shortly have private pricing info, I've moved this into the private vendor space.

@RobH, could you keep this task public & add the private pricing info in a private sub-task? This task is referenced in a recent mail thread between reading & services about capacity planning as well as other tasks, and it would be nice to keep it available for others to track.

Rob Halsell replied via email on Thu, 14 Jan 2016 10:05:36 -0800

WMF Quote request - ssds and caddies - T121575

Aaron/Rob,
I need to get a quote for some Samsung SSDs and drive caddies. We have
six DL360s (and 3 Dell systems) in Ashburn and six DL360s in Carrolton. We
need to order SSDs and caddies to add to all of these systems. (Keeping in
mind we only need SSDs for the Dell systems.) In Ashburn, we are adding 2
disks to each of the DL360s, so we only need 12 caddies. In Carrollton, we
are adding 3 of the SSDs to each of the DL260s and need sleds for all SSDs
for that site.
So for each site I'll need a quote:
EQIAD (Ashburn):

  • QTY: 18 - 1TB Samgsung 850 Pro SSD
  • QTY: 12 - DL360=C2=AD-GEN9-=C2=ADDRV-=C2=ADSLD DL380 Gen9 Compatible 2.5i=

n Drive Sled
CODFW (Carrolton):

  • QTY: 18 - 1TB Samgsung 850 Pro SSD
  • QTY: 18 - DL360=C2=AD-GEN9-=C2=ADDRV-=C2=ADSLD DL380 Gen9 Compatible 2.5i=

n Drive Sled
Please include lead times, as we would like to expedite this if possible.
Thanks in advance,
--=20
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: rhalsell@wikimedia.org
Key fingerprint =3D CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7
*0245 D22A*Office: 415.839.6885 x6620
Fax: 415.882.0495


{None}

RobH edited projects, added hardware-requests; removed procurement.EditedJan 14 2016, 6:33 PM
RobH shifted this object from the Restricted Space space to the S1 Public space.

pushed back to public, please don't append procurement to public tasks, use hardware-requests (gabriel and i are sitting less than 5 feet from one another and we already dicussed this, since this task has others linked to it my pushing it private was non-ideal)

So this will be public and I'll create a private procurement task with the pricing info.

RobH mentioned this in Unknown Object (Task).Jan 14 2016, 6:35 PM
RobH mentioned this in Unknown Object (Task).Jan 21 2016, 6:11 PM
RobH mentioned this in Unknown Object (Task).
RobH added subtasks: Unknown Object (Task), Unknown Object (Task).Jan 21 2016, 6:13 PM
RobH added a subtask: Unknown Object (Task).Jan 26 2016, 12:30 AM

Update, we did receive the wrong ssds for the HP restbase servers but the correct disks for the Dell. I added 2 new ssds to restbase1007-9.

@fgiunchedi, could you tackle the upgrades (disk and CPU/RAM) for 1007-9 soon?

RobH closed subtask Unknown Object (Task) as Resolved.Feb 4 2016, 5:35 PM
RobH added a comment.Feb 4 2016, 5:55 PM

restbase1001-1006 are now slated to be replaced with new hosts, the ordering of the new hosts and replacement of the existing ones is tracked on T125842.

Once restbase1007-1009 (disks, ram, cpu), and restbase2001-2006 (disks) are all upgraded, the only thing blocking the resolution of this task is T125842.

RobH closed subtask Unknown Object (Task) as Resolved.Feb 4 2016, 6:28 PM
RobH closed subtask Unknown Object (Task) as Declined.Feb 9 2016, 6:41 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 7:06 PM
RobH reassigned this task from RobH to fgiunchedi.Apr 27 2016, 8:58 PM

My understanding is we've now ordered all the items needed for this? I'm going to assign to @fgiunchedi for his feedback, and if so, resolution.

fgiunchedi closed this task as Resolved.Apr 28 2016, 9:24 AM

that's correct @RobH, resolving

RobH closed subtask Unknown Object (Task) as Resolved.Jul 11 2016, 5:31 PM