⚓ T277327 (Need By: 2021-04-30) rack/setup/install backup100[4-7]

	Subject	Repo	Branch	Lines +/-
	backup100[4567] setup params	operations/puppet	production	+20 -0

RobH created this task.Mar 12 2021, 5:59 PM

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

RobH added a parent task: Unknown Object (Task).

RobH mentioned this in Unknown Object (Task).

RobH unsubscribed.

Maintenance_bot added a project: SRE.Mar 12 2021, 6:45 PM

@wiki_willy we are short on 2u spaced in 10g racks while being diverse

Hi @Jclark-ctr - are there specific racks that you need the space in? We also have some high priority 740xd2 servers coming in Q1, that we should make room for at the same time. Thanks, Willy

In T277327#7088817, @Jclark-ctr wrote:

@wiki_willy we are short on 2u spaced in 10g racks while being diverse

jcrespo updated the task description. (Show Details)May 25 2021, 7:27 AM

I have added a link to https://wikitech.wikimedia.org/wiki/Raid_setup#Dell_R740xd2 on the setup. One issue we found with the RAID is that only 1 disk device can be set as bootable at the same time. For this kind of hardware, we want the first SSD device to be the one set as bootable, as otherwise the automatic recipe will not work. This means that, after setting the HDs in RAID6, we need to set "Operations > make bootable > Go" on the first ssd manually.

backup1004. A4 u9 port1 Cableid#5320
backup1005. B4 u27 port11 Cableid#5351
backup1006. C2 U15 port21 Cableid#6011
backup1007. D7 U13 port12 Cableid#3970

Jclark-ctr reassigned this task from Jclark-ctr to • Cmjohnson.Jun 29 2021, 10:23 PM

Jclark-ctr updated the task description. (Show Details)

Jclark-ctr subscribed.

All are finished with on-site tasks, the raid configuration was also completed.

RobH claimed this task.Jul 12 2021, 5:48 PM

RobH added a subscriber: • Cmjohnson.

RobH updated the task description. (Show Details)Jul 12 2021, 8:10 PM

RobH updated the task description. (Show Details)

Change 704158 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] backup100[4567] setup params

https://gerrit.wikimedia.org/r/704158

gerritbot added a project: Patch-For-Review.Jul 12 2021, 8:18 PM

Change 704158 merged by RobH:

[operations/puppet@production] backup100[4567] setup params

https://gerrit.wikimedia.org/r/704158

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['backup1004.eqiad.wmnet', 'backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet', 'backup1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107122027_robh_11220.log.

Maintenance_bot removed a project: Patch-For-Review.Jul 12 2021, 9:10 PM

These failed for not liking the specified partition recipie, which was set by someone else, so I need to investigate whats up.

In T277327#7207061, @RobH wrote:

These failed for not liking the specified partition recipie, which was set by someone else, so I need to investigate whats up.

Did you see my comment on T277327#7110254? backup-format.cfg should work as long as the above is taken into account and they have the same hw spec as the backup200X hosts, as those worked automatically for it.

I confirmed backup1004 already had raid0 ssd set to bootable, it did, and rebooted into the installer, where it worked.... I have no idea what kind of race condition is going on there but if it doesn't happen again then it doesn't matter. Reimaging.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

backup1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107132007_robh_18959_backup1004_eqiad_wmnet.log.

Ok, so in the installer, the error is:

Unable to install GRUB in /dev/sdb
Executing 'grub-install /dev/sdb' failed.
This is a fatal error.

copies of the installer logs:

hardware summary.txt137 KBDownload

partman.txt700 KBDownload

syslog.txt2 MBDownload

Completed auto-reimage of hosts:

['backup1004.eqiad.wmnet']

Of which those FAILED:

['backup1004.eqiad.wmnet']

The partitioning is working "as expected" (it is not a partman problem), the issue is with disks- I can only see an sda of "SSD" size and sdb of "HD size", while I would expect to see 3 disks, 2 non-raid SSDs and 1 virtual RAID disk. While it wouldn't be surprising for drive letters to move around between models, those disappearing or being different from the codfw ones are a weird case. My first suspicion would be an undetected "bad" disk, but given I can see the same issue on backup1005, my guess is on a difference with codfw setup at RAID setup time.

I will take backup1005, reboot to BIOS and confirm.

I confirm the issue is that there is a difference on the setup of the eqiad and codfw hosts. The eqiad ones have configured the SSDs as a virtual RAID disk on hardware (Perc controller), while on codfw they are "not-raided disks" and we set the software raid on OS (installation).

Admittedly, this is a "weird" decision, as it has several drawbacks: no hot-plugging, overhead/performance, and mixed OS and HW raid configuration. The decision was that for backups, performance and unavailability were not huge concerns, unlike the databases, but reliability was a main concern. So, we prefered to still handle the ssds at os level, even if technically they will be physically connected to the same RAID connector, it would be easier to manage them directly from a logical point of view.

While we are still on time to reverse the decision, if you don't think it is a good idea, I believe for now I would prefer to still to have the exact same configuration on all backup* hosts, even for those that don't have internal disks.

For that, could you modify the existing SSD setup on Perc configuration, and remove the HW RAID1, and convert those disk to "non-RAID disks"? That would made the partitioning work, as requested "Software RAID1 for (2) OS SSDs" on this ticket, and clarified as "Software RAID 1 will be set on reimage, so those SSDs should show as "not part of a RAID" on the bios, nothing to do there" on documentation linked also on this ticket: https://wikitech.wikimedia.org/wiki/Raid_setup#Dell_R740xd2 I tried to be super-clear about this as it wasn't as clear in the beginning, but I think I failed again.

PS: I left backup1005 on the Perc menu, and won't touch the host further unless you tell me to, in order to not do conflicting operations.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

backup1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107132206_robh_2571_backup1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['backup1004.eqiad.wmnet']

and were ALL successful.

In T277327#7210496, @jcrespo wrote:

I confirm the issue is that there is a difference on the setup of the eqiad and codfw hosts. The eqiad ones have configured the SSDs as a virtual RAID disk on hardware (Perc controller), while on codfw they are "not-raided disks" and we set the software raid on OS (installation).

Admittedly, this is a "weird" decision, as it has several drawbacks: no hot-plugging, overhead/performance, and mixed OS and HW raid configuration. The decision was that for backups, performance and unavailability were not huge concerns, unlike the databases, but reliability was a main concern. So, we prefered to still handle the ssds at os level, even if technically they will be physically connected to the same RAID connector, it would be easier to manage them directly from a logical point of view.

While we are still on time to reverse the decision, if you don't think it is a good idea, I believe for now I would prefer to still to have the exact same configuration on all backup* hosts, even for those that don't have internal disks.

For that, could you modify the existing SSD setup on Perc configuration, and remove the HW RAID1, and convert those disk to "non-RAID disks"? That would made the partitioning work, as requested "Software RAID1 for (2) OS SSDs" on this ticket, and clarified as "Software RAID 1 will be set on reimage, so those SSDs should show as "not part of a RAID" on the bios, nothing to do there" on documentation linked also on this ticket: https://wikitech.wikimedia.org/wiki/Raid_setup#Dell_R740xd2 I tried to be super-clear about this as it wasn't as clear in the beginning, but I think I failed again.

PS: I left backup1005 on the Perc menu, and won't touch the host further unless you tell me to, in order to not do conflicting operations.

Yeah I suppose I didn't parse that correctly, since mixing sw and hw raid within a host is a bit non-standard. I went ahead and did this for backup1004 and its now staged and ready to go. I'll fix the remainder shortly.

RobH updated the task description. (Show Details)Jul 13 2021, 10:32 PM

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet', 'backup1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107132244_robh_8707.log.

In T277327#7210496, @jcrespo wrote:

...
Admittedly, this is a "weird" decision, as it has several drawbacks: no hot-plugging, overhead/performance, and mixed OS and HW raid configuration. The decision was that for backups, performance and unavailability were not huge concerns, unlike the databases, but reliability was a main concern. So, we prefered to still handle the ssds at os level, even if technically they will be physically connected to the same RAID connector, it would be easier to manage them directly from a logical point of view.
...

You can still hot plug the disk bays, just manually have to remove the disk from the mdadm array and manually add it back after, but the hot swap without power down doesn't go away when you put a disk on the PERC controller into non-raid mode. I just wanted to correct that so you aren't moving forward thinking you've lost a feature of the chassis.

Completed auto-reimage of hosts:

['backup1007.eqiad.wmnet', 'backup1005.eqiad.wmnet', 'backup1006.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)Jul 13 2021, 11:27 PM

RobH updated the task description. (Show Details)

backup1006 has a hw failure and has been placed into failure in netbox. While this task is resolving, the hw failure task T286625 has been filed for an eqiad onsite to investigate this issue of the system board cp connector failure.

You can still hot plug the disk bays, just manually have to remove the disk from the mdadm array

Thanks for the correction. I think I was based on previous experiences where the disk was not accessible because of regular direct disk connection, or the host crashing on disk loss.

Thanks for your work and that of your team on this!

One last question, backup1006, despite T286625, was setup (in terms of install + puppet) successfully (no extra work will be needed after that is solved)?

Status	Assigned	Task
Open	None	T262668 WMF media storage must be adequately backed up
		Unknown Object (Task)
Resolved	RobH	T277327 (Need By: 2021-04-30) rack/setup/install backup100[4-7]

(Need By: 2021-04-30) rack/setup/install backup100[4-7]
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related Objects
Search...

Event Timeline

	F34550173: partman.txt
	Jul 13 2021, 8:19 PM

	F34550174: syslog.txt
	Jul 13 2021, 8:19 PM

	F34550172: hardware summary.txt
	Jul 13 2021, 8:19 PM

(Need By: 2021-04-30) rack/setup/install backup100[4-7]Closed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related ObjectsSearch...

Event Timeline

(Need By: 2021-04-30) rack/setup/install backup100[4-7]
Closed, ResolvedPublic
Actions

Related Objects
Search...