ATM we're supporting Buster and Bullseye for Prometheus (ops instance, deployed to every site). However for e.g. blackbox-exporter configuration options it'd be nice to have all hosts on Bullseye.
Procedure to move data from old hosts to new hosts (to be tested, and adjusted, and moved to wikitech)
Preflight checks
- hosts are running prometheus role and show up in all prometheus hosts lists (e.g. prometheus_all_nodes)
- ACLs on network devices have been updated
- mysqld grants are updated (to be verified)
Migration
- [new host] stop puppet / prometheus / thanos-sidecar@ops
- [new host] remove accumulated data so far: rm -rf /srv/prometheus/ops/metrics
- Initial rsync of data old -> new
- [old host] stop puppet / thanos-sidecar@ops / prometheus@ops . Note that once thanos-sidecar@ops is stopped here then Thanos won't be able to query data for the PoP
- Final rsync of data old -> new
- [new host] chown -R prometheus:prometheus /srv/prometheus/ops
- [new host] set replica label in puppet to match the old hosts', merge the change
- [new host] re-enable puppet and run puppet, this will restart prometheus and thanos-sidecar@ops, thus Thanos will be able to query data from the new host
- (applicable on PoPs only) Flip DNS for prometheus.svc record to point to the new host
- [old host] make sure puppet stays disabled, and thanos-sidecar@ops does not run. Ideally decom the host ASAP.
Followups
- Move the final migration procedure to wikitech
- Make sure a prolonged down of Prometheus pages
- Make sure there is a sensible default for replica_label in Puppet/Prometheus