Page MenuHomePhabricator

(Need By: TBD) install memory upgrades in ores200[1-9]
Closed, ResolvedPublic

Description

This task will track the installation of 4 additional 16GB dimms into each of the ores200[1-9] hosts in codfw. Currently each host has (4) 16GB dimms, so this will double the memory per host.

Hostname / Racking / Installation Details

Please coordinate downtime for each host with @wkandek or @akosiaris (whoever is leading this on that side).

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ores2001:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2002:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2003:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2004:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2005:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2006:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2007:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2008:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

ores2009:

  • - receive in memory on procurement task T257954 & in coupa
  • - schedule downtime of host with either @wkandek or @akosiaris (whoever is leading this on that side).
  • - depool services of host (typically handled by the sre team that runs the service, not dc ops)
  • - set host to maint mode in icinga
  • - check host sees currentl (4) 16GB dimms, power down host, install new memory, power up host and ensure new memory is recognized
  • - return host to service, remove maint mode in icinga

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a parent task: Unknown Object (Task).Aug 7 2020, 5:26 PM
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
RobH unsubscribed.

10:23 < robh> : so dc opsen
10:23 < robh> : we have a number of 'in place upgrades' this fiscal year
10:23 < robh> : should we add a column for that or should i just toss in a differnt one? cmjohnson1 replied with hw repair last week for the first of these
10:23 < robh> : but it seems this will be happening quite a bit going forward
10:23 < robh> : so ill put in hw repair for now, but also happy to add a column for 'In place upgrades' which seems less urgent than hw repair
10:24 < robh> : cmjohnson1 / papaul ^ thoughts? I don't wanna muddy up your workboards!

I think we can do all of these in a single maint window (say a 2-3hours). Since gracefully powering off a host (via a press of the power button) will also depool it during the shutdown process, we probably don't even need coordination between teams. The main reason for this is that most requests are from changeprop and will be retried anyway by it. As long as we do hosts in batches of 1 or 2 hosts we will be ok.

For what is worth, I 've already checked and all hosts see the current 4 16GB DIMMS.

@Papaul let me know of a time window that suits you. I 'll schedule the downtime in icinga and be around to monitor, but for the most part the process will mostly be:

  • Press the power button
  • Wait for the host to shutdown
  • Install new memory,
  • Power up host
  • Ensure new memory is recognized

The host will be returned to service automatically anyway once it's powered on.

How does that sound?

@akosiaris welcome back . I hope you had a great vacation. You can proceed to the downtime, I will take care of powering the servers and adding the DIMMS when on site today.

Thanks.

@akosiaris welcome back . I hope you had a great vacation. You can proceed to the downtime, I will take care of powering the servers and adding the DIMMS when on site today.

Let's actually skip today given we just completed the DC switchover and I 'd rather let thing settle a bit. How about tomorrow?

off tomorrow Thursday

It's a date! I 'll schedule downtime for about 6h (just to be on the safe side) on Thursday then.

ores2* hosts downtimed for a 8h period on Thursday, feel free to proceed.

pt1979@ores2001:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296     2626056   128034636       33420     1280604   128420204
pt1979@ores2002:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296      938024   130623356       25232      379916   130180892
pt1979@ores2003:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296     1677644   129397908       25228      865744   129379156
pt1979@ores2004:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296    31405676    98827404       25872     1708216    99646304
pt1979@ores2005:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296     1274024   130023452       25224      643820   129784280
pt1979@ores2006:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941300     2120984   128807416       25228     1012900   128934268
pt1979@ores2007:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296     1472304   129796668       25236      672324   129585512
pt1979@ores2008:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296    27541068   102714044       25860     1686184   103509940
pt1979@ores2009:~$ free
              total        used        free      shared  buff/cache   available
Mem:      131941296     1847020   129251680       25228      842596   129210644
Papaul updated the task description. (Show Details)
Papaul subscribed.

@akosiaris All yours' IF all good please go ahead and resolve the task.
Thanks

Looking at graphs and metrics, the operation was noticeable but not causing any issues. some scores errored but not particularly more than ordinary, so this is pretty nice!.

Many thanks @Papaul, awesome work!