Page MenuHomePhabricator

Move Prometheus off eqsin/ulsfo/esams bastions
Open, MediumPublic

Description

We have Ganeti clusters availabile in PoPs now, thus we should move Prometheus to a Ganeti VM and off the bastions.

Outline of steps per site:

  • Provision Prometheus VM (8GB ram / 128GB disk to start with)
  • Add VM hostname to prometheus_nodes and prometheus_all_nodes
  • Stop prometheus on bastion and VM + rsync /srv/prometheus bastion -> VM + restart Prometheus on VM and bastion
  • Flip DNS name prometheus.svc to point to the VM
  • Reimage bastion

Event Timeline

Change 609178 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cumin: update prometheus alias

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609178

Change 609178 merged by Ema:
[operations/puppet@production] cumin: update prometheus alias

https://gerrit.wikimedia.org/r/609178

Change 612267 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cumin: fix prometheus alias

https://gerrit.wikimedia.org/r/612267

Change 612267 merged by Ema:
[operations/puppet@production] cumin: fix prometheus alias

https://gerrit.wikimedia.org/r/612267

Change 613163 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add forward/reverse records for prometheus[345]001

https://gerrit.wikimedia.org/r/613163

Change 613163 merged by Herron:
[operations/dns@master] dns: add forward/reverse records for prometheus[345]001

https://gerrit.wikimedia.org/r/613163

Change 613281 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: add dhcp/netboot entries for prometheus[345]001

https://gerrit.wikimedia.org/r/613281

Change 613281 merged by Herron:
[operations/puppet@production] install_server: add dhcp/netboot entries for prometheus[345]001

https://gerrit.wikimedia.org/r/613281

Change 613293 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] assign role::insetup to prometheus[345]001

https://gerrit.wikimedia.org/r/613293

Change 613293 merged by Herron:
[operations/puppet@production] assign role::insetup to prometheus[345]001

https://gerrit.wikimedia.org/r/613293

Change 613662 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus[123]001 assign role::prometheus, add to prometheus_nodes

https://gerrit.wikimedia.org/r/613662

With Thanos in production now we'll have to add steps to cater for the switchover from one host to the other. Since the underlying data will be the same, I'm thinking we should keep the same replica_label (used for deduplication) on both hosts. If Thanos query reaches out to both hosts then I think there will be a period of time with duplicated results, which we can curb by blocking thanos-query access on e.g. the new host.

WRT uploading of blocks to Thanos, the easiest is probably to stop uploading before rsync and then re-enable uploading after rsync on the new host.

Change 613662 merged by Herron:
[operations/puppet@production] prometheus[345]001 assign role::prometheus, add to prometheus_nodes

https://gerrit.wikimedia.org/r/613662

The new prometheus hosts have started alerting today. I've acked the current alerts until you folks have time to look into it. The generate-mysqld-exporter-config alerts just require a simple grant update which i'll do tomorrow morning.

The grants have been created, all 3 new prom hosts can now successfully run generate-mysqld-exporter-config.

Change 615273 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: introduce role::prometheus::pop

https://gerrit.wikimedia.org/r/615273

Change 615273 merged by Herron:
[operations/puppet@production] prometheus: introduce role::prometheus::pop

https://gerrit.wikimedia.org/r/615273

Change 628104 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add pop hosts to prometheus_all_nodes, set replica_label

https://gerrit.wikimedia.org/r/628104

Change 628104 merged by Herron:
[operations/puppet@production] prometheus: add pop hosts to prometheus_all_nodes, set replica_label

https://gerrit.wikimedia.org/r/628104

Change 628158 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: enable rsyncd on pop hosts

https://gerrit.wikimedia.org/r/628158

Change 628158 merged by Herron:
[operations/puppet@production] prometheus: enable rsyncd on pop hosts

https://gerrit.wikimedia.org/r/628158

Change 628847 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] prometheus: point prometheus.svc.eqsin to prometheus5001

https://gerrit.wikimedia.org/r/628847

Mentioned in SAL (#wikimedia-operations) [2020-09-21T14:30:29Z] <herron> moving prometheus from bast5001 to prometheus5001 T243057

Change 628847 merged by Herron:
[operations/dns@master] prometheus: point prometheus.svc.eqsin to prometheus5001

https://gerrit.wikimedia.org/r/628847

Mentioned in SAL (#wikimedia-operations) [2020-09-21T19:46:07Z] <herron> moving prometheus instance from bast4002 to prometheus4001 T243057

Change 628925 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] prometheus: point prometheus.svc.ulsfo to prometheus4001

https://gerrit.wikimedia.org/r/628925

Change 628925 merged by Herron:
[operations/dns@master] prometheus: point prometheus.svc.ulsfo to prometheus4001

https://gerrit.wikimedia.org/r/628925

Mentioned in SAL (#wikimedia-operations) [2020-09-21T20:04:57Z] <herron> moving prometheus instance from bast3004 to prometheus3001 T243057

Change 628940 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] role::bastionhost::pop: remove prometheus instances

https://gerrit.wikimedia.org/r/628940

Change 628977 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] prometheus: point prometheus.svc.esams to prometheus3001

https://gerrit.wikimedia.org/r/628977

Change 628977 merged by Herron:
[operations/dns@master] prometheus: point prometheus.svc.esams to prometheus3001

https://gerrit.wikimedia.org/r/628977

Change 628940 merged by Herron:
[operations/puppet@production] role::bastionhost::pop: remove prometheus instances

https://gerrit.wikimedia.org/r/628940

Change 629380 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Configure bastions for Buster on next reimage

https://gerrit.wikimedia.org/r/629380

Change 629380 merged by Muehlenhoff:
[operations/puppet@production] Configure bastions for Buster on next reimage

https://gerrit.wikimedia.org/r/629380