Page MenuHomePhabricator

upgrade memory in ganeti100[5-8].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the installation of memory upgrades in ganeti100[5-8] systems. The memory upgrades were originally requested on T242885, and ordered on T243442.

Each ganeti node will likely need to be drained and rebalanced to accomodate the downtime (and full power down/power removal) for each of these hosts to have their memory go from 4*16GB dimms (64GB total) to 8*16GB dimms (128GB total).

@Jclark-ctr or @Cmjohnson will need to work with @akosiaris to drain/rebalance each node.

ganeti1005: - TBD on 2020-06-25 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

ganeti1006: - TBD on 2020-07-02 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

ganeti1007: - TBD on 2020-07-09 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

ganeti1008: - TBD on 2020-07-16 (whole day maint window)

  • - memory received in on T243442
  • - downtime scheduled with @akosiaris - please update this task description with each hosts downtime window
  • - ensure all services are scheduled for downtime or migrated before beginning work.
  • - system powered off & power removal, install 4 additional 16GB dimms, power back on and ensure POST shows full memory upgrade.
  • - hand system back to @akosiaris to restore services

Event Timeline

RobH triaged this task as Medium priority.Feb 6 2020, 7:51 PM
RobH created this task.
RobH added a parent task: Unknown Object (Task).Feb 6 2020, 7:51 PM
RobH removed a subscriber: RobH.

Disregard my last comment, i was talking about different, new, ganeti servers. Does not apply to this ticket.

Those 4 machines will have to be done one by one in order as @RobH points out. Overall, about an hour of advance notice should suffice, but let's do one each day ? I 'll add tentative maint windows (last 1 day each for your convenience) to the task

ganeti1005 is ready to receive the memory module, feel free to poweroff and proceed. Let me know when done, so I can proceed with repooling and depooling ganeti1006 then.

There's nothing rushing us on this btw, feel free to proposed alternative maint windows.

@akosiaris what times work best for you i am usually on site tuesday and thursday

@Jclark-ctr: OK, how about 1 host per week? no need for specific timeframes. I 'll have the host depooled, emptied, powered off and downtimed in icinga and ready for the memory upgrade. All you 'll need is to add the memory and power up.

How about the following?
ganeti1005 -> Thursday 25 Jun
ganeti1006 -> Thursday 02 Jul
ganeti1007 -> Thursday 09 Jul
ganeti1008 -> Thursday 16Jul

Let me know if that sounds good to you.

@akosiaris that sounds great ping me on irc if anything comes up

Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

kubestagetcd1004.eqiad.wmnet

Icinga downtime for 12:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1005.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-06-25T10:04:41Z] <akosiaris> poweroff kubestagetcd1004 and ganeti1005 for T244530

@Jclark-ctr: ganeti1005 is ready. Fully depooled, downtimed and powered off.

@akosiaris ganeti1005 is finished and booting up now Thanks!

@Jclark-ctr Excellent. I started the process of emptying ganeti1006 (and filling ganeti1005), that should take quite a while, but we should be on time for next Thursday. Many thanks!

kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga.

kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga.

No, mistake on my side. Thanks! I 've just started it up.

Icinga downtime for 3 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1006.eqiad.wmnet

Icinga downtime for 4 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1006.eqiad.wmnet

@Jclark-ctr ganeti1006 is ready for the memory upgrade. Downtimed already and powered off.

@akosiaris. finished upgrade host is powering up right now

ganeti1006 looks ok, I am already moving VMs to it. emptying ganeti1007 now.

@akosiaris I will be on site tomorrow also if host is available to do 1 day earlier

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1007.eqiad.wmnet

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

etcd1003.eqiad.wmnet

@Jclark-ctr Yes! Took a while but all migrations are done, host has been downtimed for 48H and has been powered off

@akosiaris I will be on site tomorrow also if host is available to do 1 day earlier

Did you plug in the new DIMMs yesterday?

finished with memory upgrade on ganeti1007

Change 612167 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Add ignorelist for long-term broken backups

https://gerrit.wikimedia.org/r/612167

Mentioned in SAL (#wikimedia-operations) [2020-07-13T11:44:24Z] <akosiaris> repool ganeti1007 T244530. Start emptying ganeti1008

Change 612167 merged by Jcrespo:
[operations/puppet@production] bacula: Add ignorelist for long-term broken backups

https://gerrit.wikimedia.org/r/612167

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

etcd1002.eqiad.wmnet

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

kubetcd1005.eqiad.wmnet

Icinga downtime for 2 days, 0:00:00 set by akosiaris@cumin1001 on 1 host(s) and their services with reason: Memory upgrade

ganeti1008.eqiad.wmnet

@Jclark-ctr ganeti1008 want faster than expected and it is ready for the memory upgrade. Downtimed and powered off.

@akosiaris Finished with upgrade on ganeti1008

@Jclark-ctr The mgmt interface of ganeti1008 just went down. Could you please check the cable?

@Jclark-ctr The mgmt interface of ganeti1008 just went down. Could you please check the cable?

Looks ok now in icinga, so this was transient/fixed?

Host added back in the cluster, VMs being migrated back to it.

@Jclark-ctr many thanks!