This task will track the setup of the following hosts from reimage until serving live traffic in eqiad.
- T342159 for information about naming, racking and other details.
- T350179 for issues with PXE booting
- This ticket for operations related to provisioning and rotating into production
- T352253 for decommissioning
- T352078 for hiera data consolidation
Common details
OS Distro: Bullseye (Debian 11)
Text hosts: cp1100-cp1107
Upload hosts: cp1108-cp1115
General plan
- Write hiera configuration for all new hosts in eqiad (test w/ PCC that is a NOOP for other cp hosts in eqiad/other DCs)
- Reimage first host without pooling it and check all is fine (confirm BIOS settings are correct (see T349314))
- `sudo cumin 'cp11*' 'egrep -q "vmx|svm" /proc/cpuinfo && echo yes || echo no'`
- `sudo cumin 'cp11*' 'grep -P "processor\s*:\s*95$" /proc/cpuinfo' # checking for HT, 96 total (should output 95)`
- `sudo cumin 'cp11*' 'nvme list'`
- Reimage all new hosts without pooling it
- Swap old / new host waiting 24h between each host (hosts in text and upload clusters can be swapped in parallel)
Host swap
Considering all "old" host uses the multi-ats-backend configuration, we assume that the safest way to not reduce drastically the hit rate on old hosts and at the same time introduce the new servers is:
For each (text|upload) cluster:
0. (preparation): Set all new cp hosts weight even if they are inactive (1 for cdn, 100 for ats-be).
- Depool $oldHost using confctl (eg. confctl select name=<oldhost>.eqiad.wmnet,service=cdn set/pooled=inactive).
- ONLY THE "cdn" SERVICE should be set to pooled: inactive.
- The ats-be service will still be pooled: yes. This will allow other old cp hosts to keep using the $oldHost as backend and thus keeping the hit-rate.
- cdn service weight should be set to 0.
- Remove downtime for the $newHost.
- Pool the $newHost using confctl (eg. confctl select name=<newhost>.eqiad.wmnet,service=cdn set/pooled=yes) .
- The cdn service will be set to pooled: yes
- ats-be service will be set to ats-be: no (even if the latter isn't actually used by new cp hosts).
- Wait 24h monitoring the hit rate of the new server and the general behavior of the services in eqiad.
- Repeat the swap in the same way with all hosts, always waiting 24h between each one.
- When all the legacy hosts are depooled, we can swap also the ats-be service on new nodes to pooled: yes (for consistency only, as is not used).
- Legacy hosts will be decommissioned.
Per host setup checklist
cp1100:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1075)
- Pool this host
cp1101:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1076)
- Pool this host
cp1102:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1077)
- Pool this host
cp1103:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1078)
- Pool this host
cp1104:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1079)
- Pool this host
cp1105:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1080)
- Pool this host
cp1106:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1081)
- Pool this host
cp1107:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1082)
- Pool this host
cp1108:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1083)
- Pool this host
cp1109:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1084)
- Pool this host
cp1110:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1085)
- Pool this host
cp1111:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1086)
- Pool this host
cp1112:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1087)
- Pool this host
cp1113:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1088)
- Pool this host
cp1114:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1089)
- Pool this host
cp1115:
- Add host to manifests/site.pp
- Add host to conftool-data/node/eqiad.yaml
- Add host to hieradata/common.yaml
- Add host to hieradata/common/cache.yaml
- Create per-host hiera file with configuration for dual disk
- Confirm host is actually reachable and ready for reimaging
- OS Installation & initial puppet run via sre.hosts.reimage cookbook
- Ensure the host is depooled
- Depool "corresponding" old host (cp1090)
- Pool this host
Re-ordered Host Reimage
- cp1101.eqiad.wmnet
- cp1103.eqiad.wmnet
- cp1105.eqiad.wmnet
- cp1107.eqiad.wmnet
- cp1108.eqiad.wmnet
- cp1110.eqiad.wmnet
- cp1112.eqiad.wmnet
- cp1114.eqiad.wmnet