Page MenuHomePhabricator

upgrade memory in ganeti100[5-8].eqiad.wmnet
Open, MediumPublic

Description

This task will track the installation of memory upgrades in ganeti100[5-8] systems. The memory upgrades were originally requested on T242885, and ordered on T243442.

Each ganeti node will likely need to be drained and rebalanced to accomodate the downtime (and full power down/power removal) for each of these hosts to have their memory go from 4*16GB dimms (64GB total) to 8*16GB dimms (128GB total).

@Jclark-ctr or @Cmjohnson will need to work with @akosiaris to drain/rebalance each node.

ganeti1005: - TBD on 2020-06-25 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

ganeti1006: - TBD on 2020-07-02 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

ganeti1007: - TBD on 2020-07-09 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

ganeti1008: - TBD on 2020-07-16 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

Event Timeline

RobH triaged this task as Medium priority.Feb 6 2020, 7:51 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptFeb 6 2020, 7:51 PM
RobH added a parent task: Unknown Object (Task).Feb 6 2020, 7:51 PM
RobH removed a subscriber: RobH.
Dzahn added a subscriber: Dzahn.EditedFeb 6 2020, 7:56 PM

Disregard my last comment, i was talking about different, new, ganeti servers. Does not apply to this ticket.

akosiaris added a subscriber: RobH.Feb 7 2020, 1:56 PM

Those 4 machines will have to be done one by one in order as @RobH points out. Overall, about an hour of advance notice should suffice, but let's do one each day ? I 'll add tentative maint windows (last 1 day each for your convenience) to the task

ganeti1005 is ready to receive the memory module, feel free to poweroff and proceed. Let me know when done, so I can proceed with repooling and depooling ganeti1006 then.

akosiaris updated the task description. (Show Details)Feb 7 2020, 1:57 PM

There's nothing rushing us on this btw, feel free to proposed alternative maint windows.

RobH removed a subscriber: RobH.Feb 7 2020, 4:19 PM
Cmjohnson moved this task from Backlog to Blocked on the ops-eqiad board.Feb 19 2020, 4:20 PM

@akosiaris what times work best for you i am usually on site tuesday and thursday

@Jclark-ctr: OK, how about 1 host per week? no need for specific timeframes. I 'll have the host depooled, emptied, powered off and downtimed in icinga and ready for the memory upgrade. All you 'll need is to add the memory and power up.

How about the following?
ganeti1005 -> Thursday 25 Jun
ganeti1006 -> Thursday 02 Jul
ganeti1007 -> Thursday 09 Jul
ganeti1008 -> Thursday 16Jul

Let me know if that sounds good to you.

@akosiaris that sounds great ping me on irc if anything comes up

Jclark-ctr updated the task description. (Show Details)Tue, Jun 23, 7:53 PM

Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

kubestagetcd1004.eqiad.wmnet

Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1005.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-06-25T10:04:41Z] <akosiaris> poweroff kubestagetcd1004 and ganeti1005 for T244530

akosiaris updated the task description. (Show Details)Thu, Jun 25, 10:04 AM

@Jclark-ctr: ganeti1005 is ready. Fully depooled, downtimed and powered off.

@akosiaris ganeti1005 is finished and booting up now Thanks!

akosiaris updated the task description. (Show Details)Thu, Jun 25, 2:19 PM

@Jclark-ctr Excellent. I started the process of emptying ganeti1006 (and filling ganeti1005), that should take quite a while, but we should be on time for next Thursday. Many thanks!

kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga.

kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga.

No, mistake on my side. Thanks! I 've just started it up.

Icinga downtime for 3 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1006.eqiad.wmnet

Icinga downtime for 4 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1006.eqiad.wmnet
akosiaris updated the task description. (Show Details)Wed, Jul 1, 11:05 AM
akosiaris added a comment.EditedThu, Jul 2, 3:01 PM

@Jclark-ctr ganeti1006 is ready for the memory upgrade. Downtimed already and powered off.

@akosiaris. finished upgrade host is powering up right now

Jclark-ctr updated the task description. (Show Details)Thu, Jul 2, 3:29 PM

ganeti1006 looks ok, I am already moving VMs to it. emptying ganeti1007 now.

@akosiaris I will be on site tomorrow also if host is available to do 1 day earlier

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1007.eqiad.wmnet

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

etcd1003.eqiad.wmnet
akosiaris updated the task description. (Show Details)Wed, Jul 8, 7:48 AM

@Jclark-ctr Yes! Took a while but all migrations are done, host has been downtimed for 48H and has been powered off

@akosiaris I will be on site tomorrow also if host is available to do 1 day earlier

Did you plug in the new DIMMs yesterday?

Jclark-ctr updated the task description. (Show Details)EditedFri, Jul 10, 1:14 PM

finished with memory upgrade on ganeti1007