Page MenuHomePhabricator

Expand Eqiad Ganeti row_A capacity
Closed, ResolvedPublic

Description

As outlined in T239151#5707691 the creation of a single 16G Ganeti VM in eqiad row_A currently fails with...

Failure: prerequisites not met for this operation:
error type: insufficient_resources, error details:
Can't compute nodes using iallocator 'hail': Request failed: Group row_A (preferred): No valid allocation solutions, failure reasons: FailMem: 12

Which indicates that we've run low on available memory in the eqiad row_A ganeti group.

The Eqiad Ganeti expansion being tracked in T228924#5786002 may help to some degree, in that it will allow us to move VMs from the row_A group to row_B (or row_D, depending on the TBD final racking location of the hardware).

Still, IMO we should expand capacity on eqiad row_A if we have the resources available to do so.

The current row_A hosts look to be configured with 12 DIMM slots, 4 of which are populated with 16G sticks. So there is physical capacity to e.g. double row_A memory by adding (16) 16G sticks of ram (4 sticks per host, across 4 hosts)

But since the current hardware is from 2017 and support expires this April, we probably should expand the row_A group with new servers. Refreshing with a set of 4 new 128G RAM nodes would double memory capacity in the current rack (and roughly power) footprint (after decom of the 2017 row_A hardware)

Event Timeline

Ok, next steps for this as far as I can tell:

DO NOT LIST PRICING ON THIS TASK AS IT IS PUBLIC.

  • create-sub task in procurement and price out supported memory upgrade (despite these hosts being out of warranty in April of this year, they will be in use for 5 years total.)
  • compare price of memory upgrade to the cost of ganeti nodes (recent order for codfw is on T242040).
  • get input from @herron and team on what option seems best once we have pricing.
  • escalate a private task with the two option for mgmt review/decision on which is best.
RobH mentioned this in Unknown Object (Task).Jan 22 2020, 6:45 PM
RobH added a subtask: Unknown Object (Task).

memory ordered on T243442 and implementation tracking on T244530. resolving this task

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Feb 25 2020, 10:04 PM