Page MenuHomePhabricator

titan200[12] RAM/SSD upgrade coordination
Closed, ResolvedPublic

Description

This task will coordinate the scheduling and installation of RAM and SSD upgrads into titan200[12]

These upgrades were initially requested for order on T359448, but checking on-site decommission servers/spare parts has shown we have these available there.

Hardware note: RAM from the decom's is 3200 while server uses 2666. This is a non-issue, as the faster RAM will merely clock down to the slowest clock speed ram in the bus.

titan2001:

  • - onsite: allocate (2) 480GB SSDs
  • - onsite: allocate (3) 32GB RAM
  • - onsite & service owners: schedule downtime for host
  • - service owners: depool host for maint window and put into maint in icinga when services are halted
  • - service owners: power down host for onsite work (RAM update requires full downtime)
  • - onsite: install SSD additions (do not remove existing disks, this is appending more storage not replacing)
  • - onsite: install RAM additions and ensure it is detected during post
  • - onsite & service owners: ensure system is remotely accessible for service owners to return to service.
  • - service owners: return host to service

titan2002:

  • - onsite: allocate (2) 480GB SSDs
  • - onsite: allocate (3) 32GB RAM
  • - onsite & service owners: schedule downtime for host
  • - service owners: depool host for maint window and put into maint in icinga when services are halted
  • - service owners: power down host for onsite work (RAM update requires full downtime)
  • - onsite: install SSD additions (do not remove existing disks, this is appending more storage not replacing)
  • - onsite: install RAM additions and ensure it is detected during post
  • - onsite & service owners: ensure system is remotely accessible for service owners to return to service.
  • - service owners: return host to service

Event Timeline

RobH triaged this task as Medium priority.Mar 28 2024, 1:57 PM
RobH created this task.
RobH mentioned this in Unknown Object (Task).Mar 28 2024, 1:58 PM
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.

I have located and set aside the parts to be installed.

I am available every week day between 1300 UTC and 1700 UTC. Please let me know what time/day in that works best.

RobH added a subscriber: Jhancock.wm.

I have located and set aside the parts to be installed.

I am available every week day between 1300 UTC and 1700 UTC. Please let me know what time/day in that works best.

Filippo I think you're the one to coordinate for your team (as you made the initial hardware request for upgrade) but if not please update with who should coordinate these upgrades. Thanks in advance!

@Jhancock.wm: I put a typo in the top, it should be (3) dimms per host not 2, not changing it but updating in this comment so you can acknowledge and update the task description to ensure everyone's on the same page.

My mistake, these hosts shipped with (1) 32GB dimm and we want a total of (4) per host.

retrieved the extra sticks. all good. ty for update.

fgiunchedi added a subscriber: herron.

Thank you @RobH, I've coordinated with @herron and he'll be helping with this

I have located and set aside the parts to be installed.

I am available every week day between 1300 UTC and 1700 UTC. Please let me know what time/day in that works best.

Hello! Would Thu 4/4/2024 some time after 1500 UTC work?

Mentioned in SAL (#wikimedia-operations) [2024-04-04T15:10:23Z] <herron> beginning rolling hardware upgrades on titan200[12] T361229

SSD and RAM upgrades have been installed thanks @Jhancock.wm!

@fgiunchedi how did you want to configure the raid/filesystems on titan2001?

Thank you @Jhancock.wm @herron !

I think the easiest in this case would be to:

  • have titan2001 match titan2002 (i.e. remove the 1.5TB SSD we temporarily installed in T359070; this is destructive to the host but otherwise fine since we're reimaging below)
  • switch titan* hosts to use raid0-4dev.cfg (since titan1* will eventually match anyways)
  • reimage both hosts, this can happen at any time without disruption since hosts are stateless