Change Details

We got new hardware for Prometheus in T294967 and T294302 for the scheduled refresh. The new hosts will be running Bullseye while we're at it. At a high level we want to essentially "forklift" the existing hosts. In other words we'll be copying the metrics from the old hosts into the new. During the process we'll also want to pause uploads to thanos for long term storage as to avoid duplicates (we'll keep the same `replica` label) Outline of steps: * [x] Hardware is provisioned * [x] Add the new hostnames where relevant in puppet (exact places TBD, e.g. ferm) * [x] Assign the `prometheus` role to start polling metrics. Make sure uploads to Thanos are disabled. Make sure `alertmanagers` is set empty for those hosts. * [ ] Validate that Prometheus is working as expected (e.g. can read/write metrics successfully)* [x] Make sure hosts are in routers ACLs * [ ] Sync metrics from old host into the new (exact procedure TBD* [x] Validate that Prometheus is working as expected (e.g. can read/write metrics successfully) * [ * [x] Sync metrics from old host into the new (exact procedure TBD) * [x] Re-enable Thanos uploads and pool the host for reads * [ ] Decom old hosts (note: remember to file task to remove zarcillo grants for old hosts)