Page MenuHomePhabricator

(Need By: TBD) install memory upgrades in ores100[1-9]
Closed, ResolvedPublic

Description

This task will track the installation of 4 additional 16GB dimms into each of the ores100[1-9] hosts in eqiad. Currently each host has (4) 16GB dimms, so this will double the memory per host.

Hostname / Racking / Installation Details

Please coordinate downtime for each host with @wkandek or @akosiaris (whoever is leading this on that side).

host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ores1001-1009:

  • - receive in memory on procurement task T257955 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).Aug 7 2020, 5:26 PM
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH unsubscribed.

10:23 < robh> : so dc opsen
10:23 < robh> : we have a number of 'in place upgrades' this fiscal year
10:23 < robh> : should we add a column for that or should i just toss in a differnt one? cmjohnson1 replied with hw repair last week for the first of these
10:23 < robh> : but it seems this will be happening quite a bit going forward
10:23 < robh> : so ill put in hw repair for now, but also happy to add a column for 'In place upgrades' which seems less urgent than hw repair
10:24 < robh> : cmjohnson1 / papaul ^ thoughts? I don't wanna muddy up your workboards!

Jclark-ctr subscribed.

Received memory. placed in storage room

Hi @Jclark-ctr. I think we can follow the same process for this as I outlined in T259908#6426689. Do you have a date preference?

@akosiaris Is scheduling this for this coming Wednesday too soon? 1400UTC? If not let's try Wednesday of next week same time.

@Cmjohnson, Wednesday it's fine.

In fact, I think we can do all of these in a single maint window (say a 2-3hours). Since gracefully powering off a host (via a press of the power button) will also depool it during the shutdown process, we don't even need coordination between teams. We tried that already in codfw with Papaul and it went fine. The main reason for this is that most requests are from changeprop and will be retried anyway by it. As long as we do hosts in batches of 1 or 2 hosts we will be ok.

For what is worth, I 've already checked and all hosts see the current 4 16GB DIMMS.

I 've scheduled downtime in icinga for 8h starting 12:00UTC already. I 'll be around to monitor, but for the most part the process will mostly be:

  • Press the power button
  • Wait for the host to shutdown
  • Install new memory,
  • Power up host
  • Ensure new memory is recognized
  • The host will be returned to service automatically anyway once it's powered on and has booted up normally

@akosiaris Thanks! I will get this done for you tomorrow.

Mentioned in SAL (#wikimedia-operations) [2020-09-30T13:55:01Z] <cmjohnson1> powering down ores1001 to upgrade memory T259909

Mentioned in SAL (#wikimedia-operations) [2020-09-30T14:03:24Z] <cmjohnson1> powering down ores1002 to upgrade memory T259909

Mentioned in SAL (#wikimedia-operations) [2020-09-30T14:10:52Z] <cmjohnson1> powering down ores100[3-9 to upgrade memory in each T259909

Cmjohnson updated the task description. (Show Details)

@akosiaris the memory upgrade is complete, verified all servers are up and running. Resolving this task