Page MenuHomePhabricator

Reinstall labmon1001 with new disk configuration (and jessie)
Closed, DeclinedPublic

Description

Currently the disks are configured very inefficiently, leading to super slow IO. Reinstall with a better configuration (./modules/install_server/files/autoinstall/partman/raid10-gpt-srv-lvm-ext4.cfg?) and also migrate to Jessie in the process.

Event Timeline

So in attempting to reinstall, this system will start with a working serial redirection, and then cease output for no reason. System then will attempt to continue, but without seeing what is going on, we cannot see what issues it encounters.

When attempting to send racadm commands for boot priority, it is also now outputting errrors:

/admin1-> racadm config -g cfgServerInfo -o cfgServerBootOnce 1
VKCS:Error Code : 1ERROR: Failed to set the object value.
===============================================================================
IMPORTANT NOTE!
The RAC is unable to communicate with the BMC. This condition may
occur because of (1) no BMC is present, (2) missing or disfunctional
IPMI-related software components. Many RAC features depend on BMC
connectivity in order to work properly, and you may see failures
as a result.

It will then accept the commands sent a second time, but the error points to larger underlying issues. In fact, after getting into bios to pull the RAM, the redirection died out again. Disconnecting and attempting to reconnect doesn't appear to solve the issue.

I'm advising that @Cmjohnson pull power from this system to hard reset, and see if it clears the issue. If not, this system is no longer under warranty and we'll need to look at alternatives.

The best alternative in spares is wmf4659(was restbase1004). It has 64GB memory (same as labmon1001) and similar(improved) cpu specs. We'll need to take the old disks from labmon1001 for its replacement, as the suggested replacement lacks disks.

RobH added a project: ops-eqiad.

Assignging this to @Cmjohnson and adding ops-eqiad for him to remove power entirely from labsmon1001 and add it back so we can see if it works afterwards. If not, we'll then escalate a hardware-requests for a new system.

Restricted Application added a subscriber: Southparkfan. · View Herald Transcript

The machine died on the table, never quite came back up from a restart. We did T136972 instead