Maniphest T302981

Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RobH
	Mar 3 2022, 3:10 PM

Tags

Referenced Files

None

Subscribers

View All 11 Subscribers

Tokens

"Stroopwafel" token, awarded by nskaggs.

Details

	Subject	Repo	Branch	Lines +/-
	Revert "don't rsync to clouddumps1001,2 while they are still being set up"	operations/puppet	production	+2 -2
	Dumps servers: allow rsyncing to/from new clouddumps hosts.	operations/puppet	production	+1 -1
	don't rsync to clouddumps1001,2 while they are still being set up	operations/puppet	production	+2 -2
	clouddumps: don't puppetized '/srv/dumps' mount	operations/puppet	production	+11 -8
	Give clouddumps100[12] access to hdfs and rsync things	operations/puppet	production	+11 -1
	acme_chief: give clouddumps100[12] access to the dumps certs	operations/puppet	production	+1 -1
	acme_chief: give cloudstore100[12] access to the dumps certs	operations/puppet	production	+1 -0
	Revert "clouddumps100[12]: move back to 'insetup'"	operations/puppet	production	+1 -1
	clouddumps100[12]: move back to 'insetup'	operations/puppet	production	+1 -1
	Add cloudstore with clouddumps	operations/homer/public	master	+3 -1
	Rename cloudstore to clouddump	operations/homer/public	master	+1 -1
	Partman: give up on a two-hwraid configure and just configure the first drive.	operations/puppet	production	+10 -18
	Testing partman for clouddumps node	operations/puppet	production	+2 -2
	Testing partman recipe for couddumps nodes	operations/puppet	production	+2 -2
	Testing new partman recipe for clouddumps nodes	operations/puppet	production	+1 -1
	hwraid-2dev.cfg from papaul's tests	operations/puppet	production	+43 -42
	hwraid-2dev.cfg: Throw in a few more autoconfirms	operations/puppet	production	+4 -3
	Added partman recipe 'hwraid-2dev.cfg'	operations/puppet	production	+64 -1
	Rename cloudstore101[01] to clouddumps100[12]	operations/puppet	production	+4 -3
	Adding new hosts cloudstore101[01] to site.pp and netboot.cfg	operations/puppet	production	+6 -1

Show related patches Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T88728 Improve Wikimedia dumping infrastructure
		Declined		None	T17017 Wikimedia static HTML dumps broken
		Resolved		Reedy	T281203 dumps distribution servers space issues
					Unknown Object (Task)
		Resolved		Andrew	T302981 Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts
		Resolved		Jclark-ctr	T309338 re-label cloudstore101[01] to clouddumps100[12]
		Resolved		Andrew	T309346 Replace labstore100[67] with clouddumps100[12]
		Resolved		Andrew	T310451 hdfs client packages for debian Bullseye
		Resolved		BTullis	T310643 Build Bigtop 1.5 Hadoop packages for Bullseye
		Resolved		Andrew	T316123 Auth extremely slow on clouddumps100[12]
		Resolved		Andrew	T317144 toolforge/paws k8s containers need to know about clouddumps100[12]
		Resolved		rook	T317881 Remove labstore systems
		Resolved	Request	Andrew	T319217 decommission labstore100[67].wikimedia.org
		Resolved	Feature	Dzahn	T57503 Mirror more Kiwix downloads directories
		Resolved		RobH	T91853 Hardware for HTML / zim dumps
		Declined		• GWicke	T93113 deploy francium for html/zim dumps
		Resolved		• Cmjohnson	T93114 install 4 * 3TB disks in francium - sdc error
		Resolved		RobH	T94093 Access to francium for gwicke,mobrovac,eevans (htmldumps-admins)
		Invalid		ArielGlenn	T94457 Install nodejs, nginx and other dependencies on francium

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Jclark-ctr closed subtask T309338: re-label cloudstore101[01] to clouddumps100[12] as Resolved.Jun 2 2022, 8:37 PM

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

clouddumps1001 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206022059_andrew_3536671_clouddumps1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206022330_andrew_3557519_clouddumps1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

clouddumps1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

@Andrew sorry didn't have time yesterday to work on this was doing some planing for the codfw PDU's refresh. I took time today to look a little bit on your request. I create a VM in my lab with 2x120GB disk to simulate /dev/sda and dev/sdb after the install i get the output below. Let me know if this works for you. Thanks

root@lab2003:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0   120G  0 disk 
├─sda1                 8:1    0   953M  0 part /boot
├─sda2                 8:2    0     1K  0 part 
├─sda5                 8:5    0   976M  0 part [SWAP]
└─sda6                 8:6    0 118.1G  0 part 
  └─lab2003--vg-root 254:0    0  74.5G  0 lvm  /
sdb                    8:16   0   120G  0 disk 
└─sdb1                 8:17   0   120G  0 part 
  └─lab2003--vg-srv  254:1    0  93.1G  0 lvm  /var/lib/nova/instances
sr0                   11:0    1  1024M  0 rom

root@lab2003:~# df -h
Filesystem                    Size  Used Avail Use% Mounted on
udev                          975M     0  975M   0% /dev
tmpfs                         199M  472K  198M   1% /run
/dev/mapper/lab2003--vg-root   73G  1.4G   68G   2% /
tmpfs                         992M     0  992M   0% /dev/shm
tmpfs                         5.0M     0  5.0M   0% /run/lock
/dev/sda1                     920M   85M  772M  10% /boot
/dev/mapper/lab2003--vg-srv    94G  697M   93G   1% /var/lib/nova/instances
tmpfs                         199M     0  199M   0% /run/user/0

And the partman recipe i use is below

# SPDX-License-Identifier: Apache-2.0

# Configuration to create:
# Hardware RAID1 on 2 SFF drives in flex bays mounted at /dev/sda
# 1G on /boot outside of LVM
# LVM volume of 95% remainder of sda is /
# Hardware RAID10 mounted at /dev/sdb
# 95% of sdb allocated with LVM as /srv

# remove any LVM already on the disks
d-i     partman-lvm/device_remove_lvm   boolean true

# We'll be creating LVMs and partitioning disks SDA and SDB
d-i     partman-auto/method     string  lvm
d-i     partman-auto/disk       string  /dev/sdb /dev/sda

# setup a /boot partition of 1GB outside of the LVM
d-i     partman-auto/expert_recipe      string  lvm ::  \
                1000 2000 1000 ext4     \
                                $primary{ }             \
                                $bootable{ }    \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ ext4 }      \
                                mountpoint{ /boot }     \
#                               device { /dev/sda }     \

                .       \
                1024 1024 1024 linux-swap       \
                                method{ swap }          \
                                format{ }               \
                .       \
# setup the / filesystem within the LVM filling the 95% of the remaining space
                80000 1000 -1 ext4      \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ ext4 }      \
                               lv_name{ root }         \
                                $defaultignore{ }       \
                                $lvmok{ }               \
                                mountpoint{ / } \
#                               device { /dev/sda }     \
                .       \
# setup the SDB disk with a single LVM at 95% of the disk, and a mount in xfs for /srv
                        100000 1000 -1 xfs              \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ xfs }       \
                                lv_name{ srv }          \
                                $defaultignore{ }       \
                                $lvmok{ }               \
                                mountpoint{ /var/lib/nova/instances }   \
                                device { /dev/sdb }     \
                .       \
d-i     partman-auto/choose_recipe              flat
d-i     partman-auto-lvm/guided_size    string  95%
d-i     partman/confirm_write_new_label boolean true
d-i     partman/choose_partition        select  finish
d-i     partman/confirm                 boolean true
d-i     partman/confirm_nooverwrite     boolean true
d-i     partman-lvm/confirm             boolean true
d-i     partman-lvm/confirm_nooverwrite boolean true
d-i     partman-lvm/device_remove_lvm   boolean true

Change 802900 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] hwraid-2dev.cfg from papaul's tests

https://gerrit.wikimedia.org/r/802900

Change 802900 merged by Andrew Bogott:

[operations/puppet@production] hwraid-2dev.cfg from papaul's tests

https://gerrit.wikimedia.org/r/802900

In T302981#7980873, @Papaul wrote:

@Andrew sorry didn't have time yesterday to work on this was doing some planing for the codfw PDU's refresh. I took time today to look a little bit on your request. I create a VM in my lab with 2x120GB disk to simulate /dev/sda and dev/sdb after the install i get the output below. Let me know if this works for you. Thanks

Thank you, Papaul! This looks very promising, and gets me past the partitioning stage without error. The OS installer fails with

Unable to install GRUB in /dev/sda
Executing 'grub-install /dev/sda' failed.
  
This is a fatal error.

I will experiment a bit more.

Feel free to close the task if expected, but the latest diffscan report shows that SSH is open to the world on that host:

New Open Service List
---------------------
STATUS HOST PORT PROTO OPREV CPREV DNS
OPEN 208.80.154.142 22 tcp 0 6 clouddumps1001.wikimedia.org

Confirmed from my laptop:

$ nc -zv clouddumps1001.wikimedia.org 22
Connection to clouddumps1001.wikimedia.org (2620:0:861:2:208:80:154:142) 22 port [tcp/ssh] succeeded!

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

clouddumps1001 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206061536_pt1979_27078_clouddumps1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

Change 803318 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Testing new partman recipe for clouddumps nodes

https://gerrit.wikimedia.org/r/803318

Change 803318 merged by Papaul:

[operations/puppet@production] Testing new partman recipe for clouddumps nodes

https://gerrit.wikimedia.org/r/803318

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

clouddumps1001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206061633_pt1979_33948_clouddumps1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

clouddumps1001 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206061739_pt1979_45348_clouddumps1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 803366 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Testing partman recipe for couddumps nodes

https://gerrit.wikimedia.org/r/803366

Change 803366 merged by Papaul:

[operations/puppet@production] Testing partman recipe for couddumps nodes

https://gerrit.wikimedia.org/r/803366

@Andrew it looks like the way partman is seeing disks in a raid configuration and disk in a no raid configuration is different. The same partman recipe on 2 disks in a no HW raid configuration get us the result we need

root@lab2003:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0   120G  0 disk 
├─sda1                 8:1    0   953M  0 part /boot
├─sda2                 8:2    0     1K  0 part 
├─sda5                 8:5    0   976M  0 part [SWAP]
└─sda6                 8:6    0 118.1G  0 part 
  └─lab2003--vg-root 254:0    0  74.5G  0 lvm  /
sdb                    8:16   0   120G  0 disk 
└─sdb1                 8:17   0   120G  0 part 
  └─lab2003--vg-srv  254:1    0  93.1G  0 lvm  /srv
sr0                   11:0    1  1024M  0 rom

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

That's similar to what I was seeing -- I don't understand why partman can tell the difference unless it's just the difference between a drive being big or small.

Change 803373 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Testing partman for clouddumps node

https://gerrit.wikimedia.org/r/803373

Change 803373 merged by Papaul:

[operations/puppet@production] Testing partman for clouddumps node

https://gerrit.wikimedia.org/r/803373

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

clouddumps1001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206062314_pt1979_88387_clouddumps1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Andrew mentioned this in T309346: Replace labstore100[67] with clouddumps100[12].Jun 8 2022, 10:58 PM

@Papaul, do you have interest in working on this more or should I take back the task? I'm thinking we should probably cut our losses and find a partman recipe that ignores sdb entirely, then we can manually partition.

@Andrew agree. I think the same partman recipe can do it by just removing the section below

# setup the SDB disk with a single LVM at 95% of the disk, and a mount in xfs for /srv
                        100000 1000 -1 xfs              \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ xfs }       \
                                lv_name{ srv }          \
                                $defaultignore{ }       \
                                $lvmok{ }               \
                                mountpoint{ /srv }   \
                                device { /dev/sdb }     \
                .        \
it is all yours let me know if you have any questions.

Change 804633 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Partman: give up on a two-hwraid configure and just configure the first drive.

https://gerrit.wikimedia.org/r/804633

Change 804633 merged by Andrew Bogott:

[operations/puppet@production] Partman: give up on a two-hwraid configure and just configure the first drive.

https://gerrit.wikimedia.org/r/804633

I have these hosts partitioned now (sdb by hand) so closing this task. Thanks for your help papaul!

• nskaggs awarded a token.Jun 13 2022, 2:20 PM

Change 806026 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Rename cloudstore to clouddump

https://gerrit.wikimedia.org/r/806026

Change 806026 merged by Ayounsi:

[operations/homer/public@master] Rename cloudstore to clouddump

https://gerrit.wikimedia.org/r/806026

Change 806067 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Add cloudstore with clouddumps

https://gerrit.wikimedia.org/r/806067

Change 806067 merged by Ayounsi:

[operations/homer/public@master] Add cloudstore with clouddumps

https://gerrit.wikimedia.org/r/806067

If it's of any help, our team has just had some success with a similar kind of partman recipe that creates a big LVM volume on /dev/sdb.
In our case it mounts it to /srv but I think it might be quite easy to adapt it to suit your needs.

The recipe that we're using for this is partman/custom/kafka-jumbo.cfg

We might want to make this more of a generic config at some point, because despite the name we're beginning to use this on several different server roles.

I'm putting these hosts back into 'insetup' pending hdfs packages on bullseye T310643

Change 823155 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] clouddumps100[12]: move back to 'insetup'

https://gerrit.wikimedia.org/r/823155

Change 823155 merged by Andrew Bogott:

[operations/puppet@production] clouddumps100[12]: move back to 'insetup'

https://gerrit.wikimedia.org/r/823155

@Andrew - I believe that the hadoop-client package and any others on which this work depends have now been packaged and are hosted oin apt.wikimedia.org. e.g.

btullis@apt1001:~$ sudo -i reprepro ls hadoop-client
hadoop-client |        2.8.5-2 |  stretch-wikimedia | amd64
hadoop-client |       2.10.1-1 |  stretch-wikimedia | amd64
hadoop-client |       2.10.2-1 |   buster-wikimedia | amd64
hadoop-client | 2.10.2-deb11-1 | bullseye-wikimedia | amd64

Please feel free to start testing and let me know if you run into any issues with them.

Change 823199 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "clouddumps100[12]: move back to 'insetup'"

https://gerrit.wikimedia.org/r/823199

Change 823199 merged by Andrew Bogott:

[operations/puppet@production] Revert "clouddumps100[12]: move back to 'insetup'"

https://gerrit.wikimedia.org/r/823199

Change 823200 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] acme_chief: give cloudstore100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823200

Change 823200 merged by Andrew Bogott:

[operations/puppet@production] acme_chief: give cloudstore100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823200

Change 823201 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] acme_chief: give clouddumps100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823201

Change 823201 merged by Andrew Bogott:

[operations/puppet@production] acme_chief: give clouddumps100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823201

Change 823208 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Give clouddumps100[12] access to hdfs and rsync things

https://gerrit.wikimedia.org/r/823208

Change 823208 merged by Andrew Bogott:

[operations/puppet@production] Give clouddumps100[12] access to hdfs and rsync things

https://gerrit.wikimedia.org/r/823208

Change 823217 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] clouddumps: don't puppetized '/srv/dumps' mount

https://gerrit.wikimedia.org/r/823217

Change 823217 merged by Andrew Bogott:

[operations/puppet@production] clouddumps: don't puppetized '/srv/dumps' mount

https://gerrit.wikimedia.org/r/823217

Hey @Ottomata it's great to see these hosts moving loser to being in production! One thing I noticed, they are picking up rsyncs from the dumpsdata host where dumps are generated. But it would be good if they got the backlog from one of the labstore boxes first, so our rolling rsync doesn't take hours and hours for each of them, holding up the other rsyncs. (We do them in serial in our current setup and hope to change this once the new dumpsdata hosts come on line, but this is waiting on work around the new controller.) And you'll want to do a full rsync from the labstore boxes anyways to get all the older dumps and other datasets that we do not keep or never had on the generating hosts. Thanks!

Wrong andrew, I think you meant to ping @Andrew ?

In T302981#8157062, @Ottomata wrote:

Wrong andrew, I think you meant to ping @Andrew ?

Bah, yes I did. Thank you!

Thanks for the suggestion @ArielGlenn. Those hosts are really not working at all right now (something awful is happening with the new hdfs packages vs. timesyncd) so they're likely to get reimaged another couple times before I have any faith in the data on there being useful. If you want to suggest a patch that selectively disables some of the rsyncs in the meantime that would be welcome.

Change 823649 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] don't rsync to clouddumps1001,2 while they are still being set up

https://gerrit.wikimedia.org/r/823649

Change 823649 merged by Andrew Bogott:

[operations/puppet@production] don't rsync to clouddumps1001,2 while they are still being set up

https://gerrit.wikimedia.org/r/823649

Change 825441 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Dumps servers: allow rsyncing to/from new clouddumps hosts.

https://gerrit.wikimedia.org/r/825441

Change 825441 merged by Andrew Bogott:

[operations/puppet@production] Dumps servers: allow rsyncing to/from new clouddumps hosts.

https://gerrit.wikimedia.org/r/825441

I am now running the epic rsync from labstore1006 to clouddumps100[12]. Going to take a while!

Change 828071 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "don't rsync to clouddumps1001,2 while they are still being set up"

https://gerrit.wikimedia.org/r/828071

Change 828071 merged by Andrew Bogott:

[operations/puppet@production] Revert "don't rsync to clouddumps1001,2 while they are still being set up"

https://gerrit.wikimedia.org/r/828071

Andrew closed subtask T309346: Replace labstore100[67] with clouddumps100[12] as Resolved.Oct 27 2022, 10:37 PM

Hokwelum subscribed.Oct 28 2022, 11:34 AM

Dzahn changed the status of subtask T57503: Mirror more Kiwix downloads directories from Open to In Progress.Oct 28 2022, 10:58 PM

Kelson closed subtask T57503: Mirror more Kiwix downloads directories as Resolved.Oct 29 2022, 1:02 PM