Page MenuHomePhabricator

Move Prometheus off eqsin/ulsfo/esams bastions
Closed, ResolvedPublic

Description

We have Ganeti clusters availabile in PoPs now, thus we should move Prometheus to a Ganeti VM and off the bastions.

Outline of steps per site:

  • Provision Prometheus VM (8GB ram / 128GB disk to start with)
  • Add VM hostname to prometheus_nodes and prometheus_all_nodes
  • Stop prometheus on bastion and VM + rsync /srv/prometheus bastion -> VM + restart Prometheus on VM and bastion
  • Flip DNS name prometheus.svc to point to the VM
  • Reimage bastion

Event Timeline

Change 609178 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cumin: update prometheus alias

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609178

Change 609178 merged by Ema:
[operations/puppet@production] cumin: update prometheus alias

https://gerrit.wikimedia.org/r/609178

Change 612267 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cumin: fix prometheus alias

https://gerrit.wikimedia.org/r/612267

Change 612267 merged by Ema:
[operations/puppet@production] cumin: fix prometheus alias

https://gerrit.wikimedia.org/r/612267

Change 613163 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add forward/reverse records for prometheus[345]001

https://gerrit.wikimedia.org/r/613163

Change 613163 merged by Herron:
[operations/dns@master] dns: add forward/reverse records for prometheus[345]001

https://gerrit.wikimedia.org/r/613163

Change 613281 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: add dhcp/netboot entries for prometheus[345]001

https://gerrit.wikimedia.org/r/613281

Change 613281 merged by Herron:
[operations/puppet@production] install_server: add dhcp/netboot entries for prometheus[345]001

https://gerrit.wikimedia.org/r/613281

Change 613293 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] assign role::insetup to prometheus[345]001

https://gerrit.wikimedia.org/r/613293

Change 613293 merged by Herron:
[operations/puppet@production] assign role::insetup to prometheus[345]001

https://gerrit.wikimedia.org/r/613293

Change 613662 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus[123]001 assign role::prometheus, add to prometheus_nodes

https://gerrit.wikimedia.org/r/613662

With Thanos in production now we'll have to add steps to cater for the switchover from one host to the other. Since the underlying data will be the same, I'm thinking we should keep the same replica_label (used for deduplication) on both hosts. If Thanos query reaches out to both hosts then I think there will be a period of time with duplicated results, which we can curb by blocking thanos-query access on e.g. the new host.

WRT uploading of blocks to Thanos, the easiest is probably to stop uploading before rsync and then re-enable uploading after rsync on the new host.

Change 613662 merged by Herron:
[operations/puppet@production] prometheus[345]001 assign role::prometheus, add to prometheus_nodes

https://gerrit.wikimedia.org/r/613662

The new prometheus hosts have started alerting today. I've acked the current alerts until you folks have time to look into it. The generate-mysqld-exporter-config alerts just require a simple grant update which i'll do tomorrow morning.

The grants have been created, all 3 new prom hosts can now successfully run generate-mysqld-exporter-config.

Change 615273 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: introduce role::prometheus::pop

https://gerrit.wikimedia.org/r/615273

Change 615273 merged by Herron:
[operations/puppet@production] prometheus: introduce role::prometheus::pop

https://gerrit.wikimedia.org/r/615273

Change 628104 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add pop hosts to prometheus_all_nodes, set replica_label

https://gerrit.wikimedia.org/r/628104

Change 628104 merged by Herron:
[operations/puppet@production] prometheus: add pop hosts to prometheus_all_nodes, set replica_label

https://gerrit.wikimedia.org/r/628104

Change 628158 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: enable rsyncd on pop hosts

https://gerrit.wikimedia.org/r/628158

Change 628158 merged by Herron:
[operations/puppet@production] prometheus: enable rsyncd on pop hosts

https://gerrit.wikimedia.org/r/628158

Change 628847 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] prometheus: point prometheus.svc.eqsin to prometheus5001

https://gerrit.wikimedia.org/r/628847

Mentioned in SAL (#wikimedia-operations) [2020-09-21T14:30:29Z] <herron> moving prometheus from bast5001 to prometheus5001 T243057

Change 628847 merged by Herron:
[operations/dns@master] prometheus: point prometheus.svc.eqsin to prometheus5001

https://gerrit.wikimedia.org/r/628847

Mentioned in SAL (#wikimedia-operations) [2020-09-21T19:46:07Z] <herron> moving prometheus instance from bast4002 to prometheus4001 T243057

Change 628925 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] prometheus: point prometheus.svc.ulsfo to prometheus4001

https://gerrit.wikimedia.org/r/628925

Change 628925 merged by Herron:
[operations/dns@master] prometheus: point prometheus.svc.ulsfo to prometheus4001

https://gerrit.wikimedia.org/r/628925

Mentioned in SAL (#wikimedia-operations) [2020-09-21T20:04:57Z] <herron> moving prometheus instance from bast3004 to prometheus3001 T243057

Change 628940 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] role::bastionhost::pop: remove prometheus instances

https://gerrit.wikimedia.org/r/628940

Change 628977 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] prometheus: point prometheus.svc.esams to prometheus3001

https://gerrit.wikimedia.org/r/628977

Change 628977 merged by Herron:
[operations/dns@master] prometheus: point prometheus.svc.esams to prometheus3001

https://gerrit.wikimedia.org/r/628977

Change 628940 merged by Herron:
[operations/puppet@production] role::bastionhost::pop: remove prometheus instances

https://gerrit.wikimedia.org/r/628940

Change 629380 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Configure bastions for Buster on next reimage

https://gerrit.wikimedia.org/r/629380

Change 629380 merged by Muehlenhoff:
[operations/puppet@production] Configure bastions for Buster on next reimage

https://gerrit.wikimedia.org/r/629380

When we recreate the bastions without prometheus, we don't need to use 40GB disk anymore, right?

Can we make them considerably smaller? Because today we actually ran out of disk in ganeti esams and were looking at what uses it and besides prometheus* being REALLY large, the bastions seemed the next candidate to save some space on.

When we recreate the bastions without prometheus, we don't need to use 40GB disk anymore, right?

They are intentionally a bit larger since people to use them to temporarily store some data or move things around. In the past they were baremetal hosts with much bigger disks.

We could probably shave off 10 or 20G, but I think it's optimising for the wrong case.

Can we make them considerably smaller? Because today we actually ran out of disk in ganeti esams and were looking at what uses it and besides prometheus* being REALLY large, the bastions seemed the next candidate to save some space on.

prometheus3001 is really the disk space hog here, with ~ 280G. I'd expect it to rather grow further (let's say new metrics etc.?), so those 20G wouldn't matter in practice.

Do we have some insight/projection what kind of growth we've seen for metrics storage? The obvious option is to reduce retention time (which might be non-ideal if we have different times compared to the main DCs).

The other option is to move it off Ganeti, we have unused bare metal hosts in all edge caches (overview at https://phabricator.wikimedia.org/T257324#6751480) and we could allocate them for local prometheus instead. Adding @BBlack since there were also tentative plans to use these for additional Ganeti capacity (for DoH).

When we recreate the bastions without prometheus, we don't need to use 40GB disk anymore, right?

They are intentionally a bit larger since people to use them to temporarily store some data or move things around. In the past they were baremetal hosts with much bigger disks.

We could probably shave off 10 or 20G, but I think it's optimising for the wrong case.

Can we make them considerably smaller? Because today we actually ran out of disk in ganeti esams and were looking at what uses it and besides prometheus* being REALLY large, the bastions seemed the next candidate to save some space on.

prometheus3001 is really the disk space hog here, with ~ 280G. I'd expect it to rather grow further (let's say new metrics etc.?), so those 20G wouldn't matter in practice.

Do we have some insight/projection what kind of growth we've seen for metrics storage? The obvious option is to reduce retention time (which might be non-ideal if we have different times compared to the main DCs).

I think prometheus3001 is an outlier in its disk space compared e.g. to eqsin or ulsfo, likely due to old data from the migration still there (but I'll defer to @herron for confirmation).

At steady state (and unless new use cases show up soon) I don't think now we'd need more than 150-180G in PoPs (and we can ask Prometheus to be bound by disk space available rather than time, if disk space on Ganeti becomes really tight).

Moving to baremetal could be a solution too if other approaches fail, although I like the flexibility it gives us having all non-cp hosts in PoPs on VMs

Prometheus disk usage (106G) on prometheus3001 is larger than what can comfortably fit alongside the OS on the /dev/vda (128G) so a 150G /dev/vdb was added as /srv.

If we shrank /dev/vda to 50G, or re-provisioned with a ~200G vda we could reclaim something in the ballpark of 80G.

I'd be +1 for moving to bare metal, or alternately adding that hardware as capacity to the pop ganeti clusters to reduce storage pressure too.

ACK. thanks all. Several people have mentioned the part that we have metal there this could move to. It seems to make sense to me to move prometheus to that and outside of ganeti (rather than adding another ganeti server).

I thought about it a little more and I'm convinced we should strive to keep Prometheus on VMs primarily for ease of management: on hardware faults on bare metal we'd have no redundancy (unless we provision two dedicated baremetal hosts, but that seems wasteful). We should also strive to keep all PoPs as equal as possible, to this end I propose the following:

  1. Set Prometheus retention based on disk space on PoPs, say 100G, this will leave ~18G for other duties
  2. Free space from old data on prometheus3001 and move /srv/prometheus back to / filesystem
  3. Return the additional disk of prometheus3001 back to Ganeti

We'll be trading off some time retention but we have Thanos storing the rest of the data if really needed, I think this is the minimum amount of work to wrap up this task and put a nail in the coffin to the space problem for now. We can followup with more capacity planning for Ganeti on PoPs as needed.

I thought about it a little more and I'm convinced we should strive to keep Prometheus on VMs primarily for ease of management: on hardware faults on bare metal we'd have no redundancy (unless we provision two dedicated baremetal hosts, but that seems wasteful). We should also strive to keep all PoPs as equal as possible, to this end I propose the following:

  1. Set Prometheus retention based on disk space on PoPs, say 100G, this will leave ~18G for other duties
  2. Free space from old data on prometheus3001 and move /srv/prometheus back to / filesystem

Sounds good to me!

Change 698216 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus::pop add retention size param and set to 100G

https://gerrit.wikimedia.org/r/698216

+1! I'll plan deploy the patch above (now amended to 80G retention), move data and release the vdb device from prometheus3001 next week.

Change 698216 merged by Herron:

[operations/puppet@production] prometheus::pop add retention size param and set to 80G

https://gerrit.wikimedia.org/r/698216

I ticked off the "Reimage bastion" step from the task description since that happened a while ago with the Buster update.

Mentioned in SAL (#wikimedia-operations) [2021-06-07T18:57:34Z] <herron> prometheus3001: moved /srv back to vda1 filesystem T243057

The 150G secondary disk has been removed from the prometheus3001 VM.

Strangely after gnt-instance shutdown/start prometheus3001 its network interface was renamed. Interface was ens14 before shutdown, and after rebooting it is ens13. After s/ens14/ens13/g in /etc/network/interfaces it is back online (and survived a reboot from within the os), but I wonder if there's a misconfig elsewhere that might cause this to happen again on the next gnt-instance shutdown/start?

MoritzMuehlenhoff claimed this task.

Boldly closing this task since reimages of bastions happened with the buster update, please reopen if anyone misses anything.

The 150G secondary disk has been removed from the prometheus3001 VM.

Strangely after gnt-instance shutdown/start prometheus3001 its network interface was renamed. Interface was ens14 before shutdown, and after rebooting it is ens13. After s/ens14/ens13/g in /etc/network/interfaces it is back online (and survived a reboot from within the os), but I wonder if there's a misconfig elsewhere that might cause this to happen again on the next gnt-instance shutdown/start?

I haven't seen this elsewhere, I suspect this was introduced when the temporary disk was introduced? If this happens on another host, let's open a task, but to me it seems like an one-off error.

secondary disk has been removed
Strangely .. network interface was renamed. Interface was ens14 before shutdown, and after rebooting it is ens13.

@herron @MoritzMuehlenhoff I can totally confirm this behaviour in ganeti in the opposite direction. Every time I _added_ a new virtual disk to an existing instance and did the necessary reboot to make it detect it.. my NIC was renamed.. but counting up instead of down. (ens5 -> ens6 for example) so my VM was offline. Then I had to login on console, manually edit /etc/network/interfaces just like you did and things were ok again and the new disk could be mounted.

A little while ago I added a warning section to https://wikitech.wikimedia.org/wiki/Ganeti#Adding_a_disk because of that.

also see: T272555

The 150G secondary disk has been removed from the prometheus3001 VM.

Thank you @herron :) When doing a gnt-node list on ganeti3001 I can see more free resources again.

That's interesting about the same behavior happening in the opposite direction with a disk add. I guess that makes some sense in a bug-ish kind of way -- network device being renumbered as a side-effect of changing the VM device layout. I was worried it would happen on the next reboot, but sounds like this should be stable unless we were to change the disk layout again. Feeling much better about leaving it as-is now.

I agree, pretty sure this only happens when we add/remove disks, never happened randomly on just a reboot to me.

Also had tested previously by rebooting the VM another time and checking device name stayed the same after applying the fix.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T07:49:59Z] <godog> remove 20G migration data /root/prometheus from prometheus4001 - T243057