Page MenuHomePhabricator

Upgrade the Data Engineering team's Zookeeper servers to Bullseye
Closed, ResolvedPublic

Description

Zookeeper

  • zookeeper-analytics - 3 - cumin 'P{F:lsbdistcodename = buster} and A:zookeeper-analytics'

In theory, these should be a relatively simple reimage, but let's check to see if we wish to retain the contents of /var/lib/zookeeper

Currently these servers do not have a reuse rule in place for partman
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/install_server/files/autoinstall/netboot.cfg$124

an-conf*) echo partman/standard.cfg partman/raid1-2dev.cfg ;; \

...so it might be fine to reimage and allow an empy fresh node to rejoin the cluster and get a new copy of the database. I'll check what other teams have done in this respect.

Event Timeline

BTullis renamed this task from Upgrade Zookeeper clients to Bullseye to Upgrade the Data Engineering team's Zookeeper servers to Bullseye.Feb 10 2023, 12:29 PM
BTullis updated the task description. (Show Details)

Impacted nodes are

node /an-conf100[1-3]\.eqiad\.wmnet/ {
    role(analytics_cluster::zookeeper)
}

@BTullis even if nodes will be able to rejoin the cluster if data is deleted I would be in favor of ensuring that the data is not deleted when performing a reimage.

I'm thinking of potential fat finger leading to multiple nodes being reimage...

I will look at how to ensure the data is not lost (will need your help here)

Current disk configuration
2 raid devices md0 and md1

Device     Boot    Start       End   Sectors   Size Id Type
/dev/sda1           2048  97656831  97654784  46.6G fd Linux raid autodetect
/dev/sda2       97656832 937701375 840044544 400.6G fd Linux raid autodetect

Device     Boot    Start       End   Sectors   Size Id Type
/dev/sdb1  *        2048  97656831  97654784  46.6G fd Linux raid autodetect
/dev/sdb2       97656832 937701375 840044544 400.6G fd Linux raid autodetect

Fstab

UUID=8773beee-195f-42b5-82ec-d17d8f2af41d /               ext4    errors=remount-ro 0       1
UUID=3125fd71-5977-444f-9e4a-ec828cebd0a2 /var/lib/zookeeper               ext4    errors=remount-ro 0       1
/dev/mapper/an--conf1001--vg-swap none            swap    sw              0       0

blkid

/dev/sda1: UUID="01846ec1-794f-6bc0-06a7-b9b5df0e2d23" UUID_SUB="95897606-444b-67c0-d434-8093e64e280a" LABEL="an-conf1001:0" TYPE="linux_raid_member" PARTUUID="a81d92f5-01"
/dev/sda2: UUID="5e0f2199-92fa-7766-d398-75ed7adf4785" UUID_SUB="b51cae51-729c-30b5-b13f-e8dc264b2c09" LABEL="an-conf1001:1" TYPE="linux_raid_member" PARTUUID="a81d92f5-02"
/dev/sdb1: UUID="01846ec1-794f-6bc0-06a7-b9b5df0e2d23" UUID_SUB="b770ce51-860c-da67-acf8-5b8d778c09e9" LABEL="an-conf1001:0" TYPE="linux_raid_member" PARTUUID="01c7f2c9-01"
/dev/sdb2: UUID="5e0f2199-92fa-7766-d398-75ed7adf4785" UUID_SUB="97e7349f-7d7b-bd39-e6e4-828fe1463e6f" LABEL="an-conf1001:1" TYPE="linux_raid_member" PARTUUID="01c7f2c9-02"
/dev/md0: UUID="8773beee-195f-42b5-82ec-d17d8f2af41d" TYPE="ext4"
/dev/md1: UUID="kR11NL-EVYt-9vTc-nD9u-F8Mi-pCz1-CurkGo" TYPE="LVM2_member"
/dev/mapper/an--conf1001--vg-swap: UUID="bec76501-a1e5-44e8-89c5-a49d2217f051" TYPE="swap"
/dev/mapper/an--conf1001--vg-zookeeper: UUID="3125fd71-5977-444f-9e4a-ec828cebd0a2" TYPE="ext4"

pv/vg/lv

--- Physical volume ---
PV Name               /dev/md1
VG Name               an-conf1001-vg
PV Size               <400.44 GiB / not usable 0   
Allocatable           yes 
PE Size               4.00 MiB
Total PE              102512
Free PE               75482
Allocated PE          27030
PV UUID               kR11NL-EVYt-9vTc-nD9u-F8Mi-pCz1-CurkGo
 
--- Volume group ---
VG Name               an-conf1001-vg
System ID             
Format                lvm2
Metadata Areas        1
Metadata Sequence No  5
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                3
Open LV               2
Max PV                0
Cur PV                1
Act PV                1
VG Size               <400.44 GiB
PE Size               4.00 MiB
Total PE              102512
Alloc PE / Size       27030 / <105.59 GiB
Free  PE / Size       75482 / 294.85 GiB
VG UUID               m1Ik7I-25iS-tzCo-NFl1-oZ6i-E46M-iv0Nsf
 
--- Logical volume ---
LV Path                /dev/an-conf1001-vg/swap
LV Name                swap
VG Name                an-conf1001-vg
LV UUID                7iz3zg-HuVl-iXsm-1OS4-x1ne-CIYO-eDZVOM
LV Write Access        read/write
LV Creation host, time an-conf1001, 2019-08-22 12:05:03 +0000
LV Status              available
# open                 2
LV Size                952.00 MiB
Current LE             238
Segments               1
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           253:0
 
--- Logical volume ---
LV Path                /dev/an-conf1001-vg/_placeholder
LV Name                _placeholder
VG Name                an-conf1001-vg
LV UUID                RGDP2l-qxV2-ybIQ-9Q7e-5vV9-c22k-55kBPG
LV Write Access        read/write
LV Creation host, time an-conf1001, 2019-08-22 12:05:03 +0000
LV Status              available
# open                 0
LV Size                <4.66 GiB
Current LE             1192
Segments               1
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           253:1
 
--- Logical volume ---
LV Path                /dev/an-conf1001-vg/zookeeper
LV Name                zookeeper
VG Name                an-conf1001-vg
LV UUID                HUdRFn-FMac-iAjc-h1c9-aTtf-q7kR-2OBtsf
LV Write Access        read/write
LV Creation host, time an-conf1001, 2019-09-25 15:31:52 +0000
LV Status              available
# open                 1
LV Size                100.00 GiB
Current LE             25600
Segments               1
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           253:2

This lv doesn't seems to be used: /dev/an-conf1001-vg/_placeholder

+1 on keeping /var/lib/zookeeper when doing the reimages, seems the safest bet. IIRC zookeeper is not a fixed-uid user (so we don't manage it via puppet), so after the reimage if we preserve /var/lib/zookeeper we'll have to chown -R for sure, but it is a minor nit.

I'd suggest to check in puppet preseed.cfg and check the usage of reuse-parts.cfg. In theory we should be able to create a simple partman recipe that wipes root and keeps /var/lib/zookeeper, lemme know if you need more pointers :)

Change 889954 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] resuse-zookeeper-data: add reuse partman conf for zk data

https://gerrit.wikimedia.org/r/889954

Some zookeeper from other teams are already relying on bullseye with zookeeper.
Ex. of node: conf1007.eqiad.wmnet

Change 889954 merged by Nicolas Fraison:

[operations/puppet@production] resuse-zookeeper-data: add reuse partman conf for zk data

https://gerrit.wikimedia.org/r/889954

Mentioned in SAL (#wikimedia-analytics) [2023-03-06T09:23:41Z] <nfraison> Reimage an-conf1001 to upgrade to bullseye T329362

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-conf1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1001.eqiad.wmnet with OS bullseye completed:

  • an-conf1001 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303060949_nfraison_2182237_an-conf1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

an-conf1001 reimaged but zookeeper not starting
This was due to /etc/zookeeper/conf/version-2/ not belonging to zoookeeper:zookeeper (expected as user id i snot kept on reimage)

Mentioned in SAL (#wikimedia-analytics) [2023-03-06T12:26:16Z] <nfraison> Reimage an-conf1002 to upgrade to bullseye T329362

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1002.eqiad.wmnet with OS bullseye completed:

  • an-conf1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303061232_nfraison_2239948_an-conf1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2023-03-07T08:00:41Z] <nfraison> Reimage an-conf1003 to upgrade to bullseye T329362

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1003.eqiad.wmnet with OS bullseye completed:

  • an-conf1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303070807_nfraison_2526898_an-conf1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB