Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of 2 new labstore hosts ordered via T286588. That was originally intended to simply expand existing storage on labstore100[67] but during review and quotation was shifted to 2 new hosts.

Hostname / Racking / Installation Details

The racking details will need review and sign off by WMCS as of 2022-03-03.

Hostnames: clouddumps100[12]
Racking Proposal: Rack in two different rows. Existing machines are in A4 and D2. These machines aren't 100% WMCS specific and I believe should live outside of WMCS dedicated racks. Do not put in E or F as no public vlans setup there yet.
Networking/Subnet/VLAN/IP: 10G production connection to public1 vlan is how existing labstores work
Partitioning/Raid: raid1 for os disks, raid10 for remainder
OS Distro: latest default

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

clouddumps1001:
  • - receive in system on procurement task T286588 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
clouddumps1002:
  • - receive in system on procurement task T286588 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+11 -8
operations/puppetproduction+11 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/homer/publicmaster+3 -1
operations/homer/publicmaster+1 -1
operations/puppetproduction+10 -18
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+43 -42
operations/puppetproduction+4 -3
operations/puppetproduction+64 -1
operations/puppetproduction+4 -3
operations/puppetproduction+6 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

  • clouddumps1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206022059_andrew_3536671_clouddumps1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206022330_andrew_3557519_clouddumps1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye executed with errors:

  • clouddumps1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@Andrew sorry didn't have time yesterday to work on this was doing some planing for the codfw PDU's refresh. I took time today to look a little bit on your request. I create a VM in my lab with 2x120GB disk to simulate /dev/sda and dev/sdb after the install i get the output below. Let me know if this works for you. Thanks

root@lab2003:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0   120G  0 disk 
├─sda1                 8:1    0   953M  0 part /boot
├─sda2                 8:2    0     1K  0 part 
├─sda5                 8:5    0   976M  0 part [SWAP]
└─sda6                 8:6    0 118.1G  0 part 
  └─lab2003--vg-root 254:0    0  74.5G  0 lvm  /
sdb                    8:16   0   120G  0 disk 
└─sdb1                 8:17   0   120G  0 part 
  └─lab2003--vg-srv  254:1    0  93.1G  0 lvm  /var/lib/nova/instances
sr0                   11:0    1  1024M  0 rom
root@lab2003:~# df -h
Filesystem                    Size  Used Avail Use% Mounted on
udev                          975M     0  975M   0% /dev
tmpfs                         199M  472K  198M   1% /run
/dev/mapper/lab2003--vg-root   73G  1.4G   68G   2% /
tmpfs                         992M     0  992M   0% /dev/shm
tmpfs                         5.0M     0  5.0M   0% /run/lock
/dev/sda1                     920M   85M  772M  10% /boot
/dev/mapper/lab2003--vg-srv    94G  697M   93G   1% /var/lib/nova/instances
tmpfs                         199M     0  199M   0% /run/user/0

And the partman recipe i use is below

# SPDX-License-Identifier: Apache-2.0

# Configuration to create:
# Hardware RAID1 on 2 SFF drives in flex bays mounted at /dev/sda
# 1G on /boot outside of LVM
# LVM volume of 95% remainder of sda is /
# Hardware RAID10 mounted at /dev/sdb
# 95% of sdb allocated with LVM as /srv

# remove any LVM already on the disks
d-i     partman-lvm/device_remove_lvm   boolean true

# We'll be creating LVMs and partitioning disks SDA and SDB
d-i     partman-auto/method     string  lvm
d-i     partman-auto/disk       string  /dev/sdb /dev/sda

# setup a /boot partition of 1GB outside of the LVM
d-i     partman-auto/expert_recipe      string  lvm ::  \
                1000 2000 1000 ext4     \
                                $primary{ }             \
                                $bootable{ }    \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ ext4 }      \
                                mountpoint{ /boot }     \
#                               device { /dev/sda }     \

                .       \
                1024 1024 1024 linux-swap       \
                                method{ swap }          \
                                format{ }               \
                .       \
# setup the / filesystem within the LVM filling the 95% of the remaining space
                80000 1000 -1 ext4      \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ ext4 }      \
                               lv_name{ root }         \
                                $defaultignore{ }       \
                                $lvmok{ }               \
                                mountpoint{ / } \
#                               device { /dev/sda }     \
                .       \
# setup the SDB disk with a single LVM at 95% of the disk, and a mount in xfs for /srv
                        100000 1000 -1 xfs              \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ xfs }       \
                                lv_name{ srv }          \
                                $defaultignore{ }       \
                                $lvmok{ }               \
                                mountpoint{ /var/lib/nova/instances }   \
                                device { /dev/sdb }     \
                .       \
d-i     partman-auto/choose_recipe              flat
d-i     partman-auto-lvm/guided_size    string  95%
d-i     partman/confirm_write_new_label boolean true
d-i     partman/choose_partition        select  finish
d-i     partman/confirm                 boolean true
d-i     partman/confirm_nooverwrite     boolean true
d-i     partman-lvm/confirm             boolean true
d-i     partman-lvm/confirm_nooverwrite boolean true
d-i     partman-lvm/device_remove_lvm   boolean true

Change 802900 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] hwraid-2dev.cfg from papaul's tests

https://gerrit.wikimedia.org/r/802900

Change 802900 merged by Andrew Bogott:

[operations/puppet@production] hwraid-2dev.cfg from papaul's tests

https://gerrit.wikimedia.org/r/802900

@Andrew sorry didn't have time yesterday to work on this was doing some planing for the codfw PDU's refresh. I took time today to look a little bit on your request. I create a VM in my lab with 2x120GB disk to simulate /dev/sda and dev/sdb after the install i get the output below. Let me know if this works for you. Thanks

Thank you, Papaul! This looks very promising, and gets me past the partitioning stage without error. The OS installer fails with

Unable to install GRUB in /dev/sda
Executing 'grub-install /dev/sda' failed.
  
This is a fatal error.

I will experiment a bit more.

Feel free to close the task if expected, but the latest diffscan report shows that SSH is open to the world on that host:

New Open Service List
---------------------
STATUS HOST PORT PROTO OPREV CPREV DNS
OPEN 208.80.154.142 22 tcp 0 6 clouddumps1001.wikimedia.org

Confirmed from my laptop:

$ nc -zv clouddumps1001.wikimedia.org 22
Connection to clouddumps1001.wikimedia.org (2620:0:861:2:208:80:154:142) 22 port [tcp/ssh] succeeded!

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

  • clouddumps1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206061536_pt1979_27078_clouddumps1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Change 803318 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Testing new partman recipe for clouddumps nodes

https://gerrit.wikimedia.org/r/803318

Change 803318 merged by Papaul:

[operations/puppet@production] Testing new partman recipe for clouddumps nodes

https://gerrit.wikimedia.org/r/803318

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

  • clouddumps1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206061633_pt1979_33948_clouddumps1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

  • clouddumps1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206061739_pt1979_45348_clouddumps1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 803366 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Testing partman recipe for couddumps nodes

https://gerrit.wikimedia.org/r/803366

Change 803366 merged by Papaul:

[operations/puppet@production] Testing partman recipe for couddumps nodes

https://gerrit.wikimedia.org/r/803366

@Andrew it looks like the way partman is seeing disks in a raid configuration and disk in a no raid configuration is different. The same partman recipe on 2 disks in a no HW raid configuration get us the result we need

root@lab2003:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0   120G  0 disk 
├─sda1                 8:1    0   953M  0 part /boot
├─sda2                 8:2    0     1K  0 part 
├─sda5                 8:5    0   976M  0 part [SWAP]
└─sda6                 8:6    0 118.1G  0 part 
  └─lab2003--vg-root 254:0    0  74.5G  0 lvm  /
sdb                    8:16   0   120G  0 disk 
└─sdb1                 8:17   0   120G  0 part 
  └─lab2003--vg-srv  254:1    0  93.1G  0 lvm  /srv
sr0                   11:0    1  1024M  0 rom

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye

That's similar to what I was seeing -- I don't understand why partman can tell the difference unless it's just the difference between a drive being big or small.

Change 803373 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Testing partman for clouddumps node

https://gerrit.wikimedia.org/r/803373

Change 803373 merged by Papaul:

[operations/puppet@production] Testing partman for clouddumps node

https://gerrit.wikimedia.org/r/803373

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with OS bullseye completed:

  • clouddumps1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206062314_pt1979_88387_clouddumps1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@Papaul, do you have interest in working on this more or should I take back the task? I'm thinking we should probably cut our losses and find a partman recipe that ignores sdb entirely, then we can manually partition.

@Andrew agree. I think the same partman recipe can do it by just removing the section below

# setup the SDB disk with a single LVM at 95% of the disk, and a mount in xfs for /srv
                        100000 1000 -1 xfs              \
                                method{ format }        \
                                format{ }               \
                                use_filesystem{ }       \
                                filesystem{ xfs }       \
                                lv_name{ srv }          \
                                $defaultignore{ }       \
                                $lvmok{ }               \
                                mountpoint{ /srv }   \
                                device { /dev/sdb }     \
                .        \
it is all yours let me know if you have any questions.

Change 804633 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Partman: give up on a two-hwraid configure and just configure the first drive.

https://gerrit.wikimedia.org/r/804633

Change 804633 merged by Andrew Bogott:

[operations/puppet@production] Partman: give up on a two-hwraid configure and just configure the first drive.

https://gerrit.wikimedia.org/r/804633

I have these hosts partitioned now (sdb by hand) so closing this task. Thanks for your help papaul!

Change 806026 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Rename cloudstore to clouddump

https://gerrit.wikimedia.org/r/806026

Change 806026 merged by Ayounsi:

[operations/homer/public@master] Rename cloudstore to clouddump

https://gerrit.wikimedia.org/r/806026

Change 806067 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Add cloudstore with clouddumps

https://gerrit.wikimedia.org/r/806067

Change 806067 merged by Ayounsi:

[operations/homer/public@master] Add cloudstore with clouddumps

https://gerrit.wikimedia.org/r/806067

If it's of any help, our team has just had some success with a similar kind of partman recipe that creates a big LVM volume on /dev/sdb.
In our case it mounts it to /srv but I think it might be quite easy to adapt it to suit your needs.

The recipe that we're using for this is partman/custom/kafka-jumbo.cfg

We might want to make this more of a generic config at some point, because despite the name we're beginning to use this on several different server roles.

I'm putting these hosts back into 'insetup' pending hdfs packages on bullseye T310643

Change 823155 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] clouddumps100[12]: move back to 'insetup'

https://gerrit.wikimedia.org/r/823155

Change 823155 merged by Andrew Bogott:

[operations/puppet@production] clouddumps100[12]: move back to 'insetup'

https://gerrit.wikimedia.org/r/823155

@Andrew - I believe that the hadoop-client package and any others on which this work depends have now been packaged and are hosted oin apt.wikimedia.org. e.g.

btullis@apt1001:~$ sudo -i reprepro ls hadoop-client
hadoop-client |        2.8.5-2 |  stretch-wikimedia | amd64
hadoop-client |       2.10.1-1 |  stretch-wikimedia | amd64
hadoop-client |       2.10.2-1 |   buster-wikimedia | amd64
hadoop-client | 2.10.2-deb11-1 | bullseye-wikimedia | amd64

Please feel free to start testing and let me know if you run into any issues with them.

Change 823199 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "clouddumps100[12]: move back to 'insetup'"

https://gerrit.wikimedia.org/r/823199

Change 823199 merged by Andrew Bogott:

[operations/puppet@production] Revert "clouddumps100[12]: move back to 'insetup'"

https://gerrit.wikimedia.org/r/823199

Change 823200 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] acme_chief: give cloudstore100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823200

Change 823200 merged by Andrew Bogott:

[operations/puppet@production] acme_chief: give cloudstore100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823200

Change 823201 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] acme_chief: give clouddumps100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823201

Change 823201 merged by Andrew Bogott:

[operations/puppet@production] acme_chief: give clouddumps100[12] access to the dumps certs

https://gerrit.wikimedia.org/r/823201

Change 823208 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Give clouddumps100[12] access to hdfs and rsync things

https://gerrit.wikimedia.org/r/823208

Change 823208 merged by Andrew Bogott:

[operations/puppet@production] Give clouddumps100[12] access to hdfs and rsync things

https://gerrit.wikimedia.org/r/823208

Change 823217 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] clouddumps: don't puppetized '/srv/dumps' mount

https://gerrit.wikimedia.org/r/823217

Change 823217 merged by Andrew Bogott:

[operations/puppet@production] clouddumps: don't puppetized '/srv/dumps' mount

https://gerrit.wikimedia.org/r/823217

Hey @Ottomata it's great to see these hosts moving loser to being in production! One thing I noticed, they are picking up rsyncs from the dumpsdata host where dumps are generated. But it would be good if they got the backlog from one of the labstore boxes first, so our rolling rsync doesn't take hours and hours for each of them, holding up the other rsyncs. (We do them in serial in our current setup and hope to change this once the new dumpsdata hosts come on line, but this is waiting on work around the new controller.) And you'll want to do a full rsync from the labstore boxes anyways to get all the older dumps and other datasets that we do not keep or never had on the generating hosts. Thanks!

Wrong andrew, I think you meant to ping @Andrew ?

Wrong andrew, I think you meant to ping @Andrew ?

Bah, yes I did. Thank you!

Thanks for the suggestion @ArielGlenn. Those hosts are really not working at all right now (something awful is happening with the new hdfs packages vs. timesyncd) so they're likely to get reimaged another couple times before I have any faith in the data on there being useful. If you want to suggest a patch that selectively disables some of the rsyncs in the meantime that would be welcome.

Change 823649 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/puppet@production] don't rsync to clouddumps1001,2 while they are still being set up

https://gerrit.wikimedia.org/r/823649

Change 823649 merged by Andrew Bogott:

[operations/puppet@production] don't rsync to clouddumps1001,2 while they are still being set up

https://gerrit.wikimedia.org/r/823649

Change 825441 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Dumps servers: allow rsyncing to/from new clouddumps hosts.

https://gerrit.wikimedia.org/r/825441

Change 825441 merged by Andrew Bogott:

[operations/puppet@production] Dumps servers: allow rsyncing to/from new clouddumps hosts.

https://gerrit.wikimedia.org/r/825441

I am now running the epic rsync from labstore1006 to clouddumps100[12]. Going to take a while!

Change 828071 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "don't rsync to clouddumps1001,2 while they are still being set up"

https://gerrit.wikimedia.org/r/828071

Change 828071 merged by Andrew Bogott:

[operations/puppet@production] Revert "don't rsync to clouddumps1001,2 while they are still being set up"

https://gerrit.wikimedia.org/r/828071