Page MenuHomePhabricator

(Need By: 2020-09-15) upgrade/replace memory in stat100[58]
Closed, ResolvedPublic

Description

This task tracks the memory upgrade of stat100[58], replacing all 16GB dimms to give each host a total of 512GB 32GB dimm memory.

Need by date: robh just picked 2020-10-31, as the original request simply states (sometime in Q1/Q2).

Need by date: @elukey set 2020-09-15 as new deadline, if it can work.

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

stat1005:

  • - receive in memory on procurement task T256017 & in coupa
  • - schedule downtime with @elukey or another member of the Analytics team. (most hardware issues are handled by @elukey or @Ottomata.)
  • - power down system, swap all 16GB dimms out and install enough 32GB dimms to total 512GB.
  • - power up system, ensure system sees all new dimms (recommend lshw -class memory)
  • - return system to service (Analytics team)

stat1008:

  • - receive in memory on procurement task T256017 & in coupa
  • - schedule downtime with @elukey or another member of the Analytics team. (most hardware issues are handled by @elukey or @Ottomata.)
  • - power down system, append in the 12 dimms to existing 32GB dimms
  • - power up system, ensure system sees all new dimms (recommend lshw -class memory)
  • - return system to service (Analytics team)

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).Aug 14 2020, 4:40 PM
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH moved this task from Backlog to Acknowledged on the SRE board.
RobH unsubscribed.
elukey renamed this task from (Need By: 2020-10-31) upgrade/replace memory in stat100[58] to (Need By: 2020-09-15) upgrade/replace memory in stat100[58].Aug 14 2020, 7:27 PM
elukey updated the task description. (Show Details)
elukey added a subscriber: RobH.
elukey removed a subscriber: RobH.
Jclark-ctr added a subscriber: RobH.
Jclark-ctr subscribed.

received memory placed in storage room

@Jclark-ctr I'd need to schedule the maintenance in advance to let people know that we are rebooting (a lot of users use these hosts), when you folks are ready let's decide a date/time and I'll communicate it :)

@elukey Can you do this Monday 5 October 1400UTC?

@RobH @wiki_willy I am looking at the packing slip and what I have in the data center and it appears we're 4 DIMM short. The packing slip is correct showing 16 and 12 sent, which matches what I have received but the task calls for replacing with 16 DIMM.

RobH mentioned this in Unknown Object (Task).Oct 5 2020, 2:29 PM

@RobH @wiki_willy I am looking at the packing slip and what I have in the data center and it appears we're 4 DIMM short. The packing slip is correct showing 16 and 12 sent, which matches what I have received but the task calls for replacing with 16 DIMM.

Turns out the rackign task wasn't clear:

stat1005 swaps existing 16GB for the 16 32GB dimms
stat1008 appends in the 12 32GB dimms to its exsisting memory

end result is both systems will end up with 512GB total ram when done.

stat1008, I added all the DIMM and the server would not boot, I received the following error

UEFI0060: Power required by the system exceeds the power supplied by the Power
Supply Units (PSUs).
Check the PSU and system configuration, and then upgrade the PSU, if necessary.

I started backing DIMM down 2 at a time until the server booted without the error, I no have (4) 32GB DIMM on each side

@wiki_willy @RobH @elukey if you want to investigate power supplies please create a new task. I do not want to keep this task open for a different issue.

Cmjohnson subscribed.

assigning this to @wiki_willy to figure out whether we want to upgrade the power supplies

Weird, I've never seen PSU's being affected like that, after a memory before, so I'll take it as an action item to reach out to the vendor, and find out if they have any recommendations. Thanks, Willy

Spoke to our technical Dell rep today, and followed up with an email. Hopefully there's an easy way to get it working. If not, we'll close out this task and figure out a way to procure the new PSUs. Thanks, Willy

PSUs were shipped out today and should arrive next week. Assigning back to @Cmjohnson to complete the PSU swaps on stat1008 and add the DIMMs back in. Thanks, Willy

Looks like it's still not shipped yet. Dell has an order number, but no tracking number yet for shipment.

I followed up with Dell (during my regular meeting with them) about the status of the PSUs, and they said it was delivered on November 2, and will email the tracking details soon. But they should be onsite in Equinix's shipping location now.

Thanks,
Willy

Tracking #935433832396

Entered ticket 1-202596400888 to have this received and put in our cage.

Just got notice this was received, so it should be delivered to our cage/storage now.

As FYI I'd need to advertise the host maintenance a couple of days in advance to users, so happy to work with Chris/John in any time window but we need to schedule it first :)

@elukey Let's schedule this for next Tuesday please 1500UTC (10EST)

@elukey Let's schedule this for next Tuesday please 1500UTC (10EST)

Looks good for me, going to send an announce to analytics users now :)

added the new power supplies (will keep the older ones for spares). Added all the new memory sticks. resolving this tasks, if something comes up related to the upgrade please ping me and re-open.