Page MenuHomePhabricator

esams cache cluster re-arrangements, early 2016
Closed, ResolvedPublic

Description

TL;DR - Given we're overprovisioned (more than we normally account for) in the esams upload cluster (due to being conservative on unknowns during TLS hardware purchases), we can shift good hardware down the line and kick out some of our oldest and worst machines in the process. This simplifies the current cluster hardware layouts, leaves us with fewer total unique hardware configs to deal with, and makes further budgeting/reasoning simpler.

Hardware notes:
cp3003-14 - 12x Older spec, earliest warranty batch, larger SSDs
cp30015-18 - 4x as above, but smaller SSDs
cp3019-22 - 4x as above, but no SSDs and *much* smaller RAM
cp3030-3049 - 20x newest-spec (purchased in TLS rollout era)

Current layout:

ClusterMachines
Text4x newest (cp30[34][01]) + 12x older (cp3003-14)
Upload16x newest (cp30[34][2-9]) (over-provisioned during TLS rollout)
Mobile4x older (cp3015-18, smaller disks than 3-14)
Misc4x older non-SSD (cp3019-22)

Proposed:

ClusterMachinesNotes
Text8x newest (cp30[34][0123])Loses 12x older w/ bigger SSD, gains 4x newest from upload
Upload12x newest (cp30[34][4-9])Loses 4x newest (not needed)
Mobile4x older (cp3003-6)Loses 4x older w/ smaller SSD, gains 4x older w/ bigger SSD from text
Misc4x older (cp3007-10)Loses 4x older w/o SSD, gains 4x older w/ bigger SSD from text
Reclaim/Spare/Decomcp3011-22The 12 worst machines can be decom/spare - 4x bigger SSD, 4x smaller SSD, 4x no-SSD)

In this new state, we would only have 8x of the older-warranty machines left, they need 1:1 replacements with new-spec hardware when we decide to replace them, they're all identical on RAM/SSD (so we're down to 2x active hw configs total in esams), and they're all in the Mobile and Misc clusters (whereas Text+Upload has all the newer hardware with more warranty left).

Steps to get from Here to There:

  • 1. Move 30[34][23] from cache_upload to cache_text
  • 2. Move 3007-10 from cache_text to cache_misc
  • 3. Remove 3019-22 from cache_misc (decom/reclaim/spare)
  • 4. Move 3003-6 from cache_text to cache_mobile
  • 5. Remove 3015-18 from cache_mobile (decom/reclaim/spare)
  • 6. Remove 3011-14 from cache_text (decom/reclaim/spare)

Details

Related Gerrit Patches:
operations/puppet : productionesams re-arrangement steps 4-6
operations/puppet : productioncp3019-22: decom from cache_misc
operations/puppet : productioncp3007-10 esams text->misc re-role
operations/puppet : productioncp30[34][23] - esams upload->text re-role
operations/puppet : productioncaches: remove backend_scaled_weights

Event Timeline

BBlack created this task.Feb 2 2016, 3:40 AM
BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added projects: Traffic, ops-esams.
BBlack added subscribers: BBlack, mark, ema, faidon.
Restricted Application added a project: Operations. · View Herald TranscriptFeb 2 2016, 3:40 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 275115 had a related patch set uploaded (by BBlack):
caches: remove backend_scaled_weights

https://gerrit.wikimedia.org/r/275115

BBlack updated the task description. (Show Details)Mar 6 2016, 2:58 PM
BBlack set Security to None.

Change 275115 merged by BBlack:
caches: remove backend_scaled_weights

https://gerrit.wikimedia.org/r/275115

Mentioned in SAL [2016-03-23T21:38:30Z] <bblack> depooling cp3042 - T125485

Mentioned in SAL [2016-03-23T21:41:25Z] <bblack> depooling cp3043 - T125485

^ Stashbot wasn't around for the same on cp303[23] before those two

Change 279255 had a related patch set uploaded (by BBlack):
cp30[34][23] - esams upload->text re-role

https://gerrit.wikimedia.org/r/279255

BBlack updated the task description. (Show Details)Mar 23 2016, 9:53 PM

Change 279291 had a related patch set uploaded (by BBlack):
cp3007-10 esams text->misc re-role

https://gerrit.wikimedia.org/r/279291

Change 279292 had a related patch set uploaded (by BBlack):
cp3019-22: decom from cache_misc

https://gerrit.wikimedia.org/r/279292

Change 279293 had a related patch set uploaded (by BBlack):
esams re-arrangement steps 4-6

https://gerrit.wikimedia.org/r/279293

Change 279255 merged by BBlack:
cp30[34][23] - esams upload->text re-role

https://gerrit.wikimedia.org/r/279255

BBlack updated the task description. (Show Details)Mar 24 2016, 3:34 AM

Mentioned in SAL [2016-03-24T15:09:43Z] <bblack> downtime/re-role of cp3007-10 starting - T125485

Change 279291 merged by BBlack:
cp3007-10 esams text->misc re-role

https://gerrit.wikimedia.org/r/279291

Change 279292 merged by BBlack:
cp3019-22: decom from cache_misc

https://gerrit.wikimedia.org/r/279292

BBlack updated the task description. (Show Details)Mar 24 2016, 3:58 PM

Mentioned in SAL [2016-03-24T16:25:27Z] <bblack> depooling cp3003-6,cp3011-14 from esams text varnish-be over the next ~70 mins - prep for last steps of T125485

Change 279293 merged by BBlack:
esams re-arrangement steps 4-6

https://gerrit.wikimedia.org/r/279293

BBlack updated the task description. (Show Details)Mar 24 2016, 6:32 PM

Now what's left is the decom/reclaim/spare stuff for the appropriate servers (cp3011-22, noting that 11 was already dead for hardware issues...).

BBlack closed this task as Resolved.Mar 24 2016, 8:56 PM
BBlack claimed this task.