We got new hardware for Prometheus in T294967 and T294302 for the scheduled refresh. The new hosts will be running Bullseye while we're at it.
At a high level we want to essentially "forklift" the existing hosts. In other words we'll be copying the metrics from the old hosts into the new. During the process we'll also want to pause uploads to thanos for long term storage as to avoid duplicates (we'll keep the same replica label)
Outline of steps:
- Hardware is provisioned
- Add the new hostnames where relevant in puppet (exact places TBD, e.g. ferm)
- Assign the prometheus role to start polling metrics. Make sure uploads to Thanos are disabled. Make sure alertmanagers is set empty for those hosts.
- Make sure hosts are in routers ACLs
- Validate that Prometheus is working as expected (e.g. can read/write metrics successfully)
- Sync metrics from old host into the new (exact procedure TBD)
- Re-enable Thanos uploads and pool the host for reads
- Decom old hosts (note: remember to file task to remove zarcillo grants for old hosts, and remove hosts from router ACLs)