Page MenuHomePhabricator

Prometheus hardware refresh (+ Bullseye upgrade)
Closed, ResolvedPublic

Description

We got new hardware for Prometheus in T294967 and T294302 for the scheduled refresh. The new hosts will be running Bullseye while we're at it.

At a high level we want to essentially "forklift" the existing hosts. In other words we'll be copying the metrics from the old hosts into the new. During the process we'll also want to pause uploads to thanos for long term storage as to avoid duplicates (we'll keep the same replica label)

Outline of steps:

  • Hardware is provisioned
  • Add the new hostnames where relevant in puppet (exact places TBD, e.g. ferm)
  • Assign the prometheus role to start polling metrics. Make sure uploads to Thanos are disabled. Make sure alertmanagers is set empty for those hosts.
  • Make sure hosts are in routers ACLs
  • Validate that Prometheus is working as expected (e.g. can read/write metrics successfully)
  • Sync metrics from old host into the new (exact procedure TBD)
  • Re-enable Thanos uploads and pool the host for reads
  • Decom old hosts (note: remember to file task to remove zarcillo grants for old hosts, and remove hosts from router ACLs)

Details

ProjectBranchLines +/-Subject
operations/homer/publicmaster+0 -16
operations/puppetproduction+0 -32
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+0 -11
operations/puppetproduction+50 -126
operations/homer/publicmaster+16 -0
operations/dnsmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+2 -0
operations/puppetproduction+26 -16
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+0 -15
operations/puppetproduction+59 -0
operations/puppetproduction+7 -12
operations/puppetproduction+8 -0
operations/puppetproduction+38 -4
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

fgiunchedi renamed this task from Prometheus hardware refresh to Prometheus hardware refresh (+ Bullseye upgrade).Nov 23 2021, 1:05 PM

Change 755708 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Add prometheus[12]00[56] to prometheus_nodes

https://gerrit.wikimedia.org/r/755708

Change 755711 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: add host-specific Prometheus data

https://gerrit.wikimedia.org/r/755711

Change 755712 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: move to a single flag to control uploads

https://gerrit.wikimedia.org/r/755712

Change 755918 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: bump open files limit for blackbox exporter

https://gerrit.wikimedia.org/r/755918

Change 755918 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: bump open files limit for blackbox exporter

https://gerrit.wikimedia.org/r/755918

Change 755922 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: override valid status codes for http probes

https://gerrit.wikimedia.org/r/755922

Change 755708 merged by Filippo Giunchedi:

[operations/puppet@production] Add prometheus[12]00[56] to prometheus_nodes

https://gerrit.wikimedia.org/r/755708

Change 755711 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: add host-specific Prometheus data

https://gerrit.wikimedia.org/r/755711

Change 755712 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: move to a single flag to control uploads

https://gerrit.wikimedia.org/r/755712

Change 756602 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: filesystem provisioning

https://gerrit.wikimedia.org/r/756602

Change 756603 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] site: add Prometheus role to codfw hardware

https://gerrit.wikimedia.org/r/756603

Change 756604 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] site: add Prometheus role to eqiad hardware

https://gerrit.wikimedia.org/r/756604

Change 756602 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: filesystem provisioning

https://gerrit.wikimedia.org/r/756602

Change 756607 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: disable rsync where not needed

https://gerrit.wikimedia.org/r/756607

Change 756607 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: disable rsync where not needed

https://gerrit.wikimedia.org/r/756607

Change 756603 merged by Filippo Giunchedi:

[operations/puppet@production] site: add Prometheus role to codfw hardware

https://gerrit.wikimedia.org/r/756603

Mentioned in SAL (#wikimedia-operations) [2022-01-25T11:07:57Z] <godog> temp disable alerting on prometheus200[56] - T296199

Change 756965 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: temp disable alerting for new prometheus hw

https://gerrit.wikimedia.org/r/756965

Change 756965 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: temp disable alerting for new prometheus hw

https://gerrit.wikimedia.org/r/756965

Change 756979 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: refactor rsync in a standalone profile

https://gerrit.wikimedia.org/r/756979

Change 756979 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: refactor rsync in a standalone profile

https://gerrit.wikimedia.org/r/756979

Mentioned in SAL (#wikimedia-operations) [2022-01-26T09:28:00Z] <godog> begin rsync prometheus2004 -> 2005 - T296199

Change 757612 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] conftool: add prometheus200[56]

https://gerrit.wikimedia.org/r/757612

Change 757612 merged by Filippo Giunchedi:

[operations/puppet@production] conftool: add prometheus200[56]

https://gerrit.wikimedia.org/r/757612

Change 757623 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: swap prometheus2003 with prometheus2005

https://gerrit.wikimedia.org/r/757623

Change 757623 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: swap prometheus2003 with prometheus2005

https://gerrit.wikimedia.org/r/757623

Mentioned in SAL (#wikimedia-operations) [2022-01-28T09:17:22Z] <godog> pool prometheus2005 and depool prometheus2003 - T296199

Change 756604 merged by Filippo Giunchedi:

[operations/puppet@production] site: add Prometheus role to eqiad hardware

https://gerrit.wikimedia.org/r/756604

Change 758776 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: swap prometheus1003 with prometheus1005

https://gerrit.wikimedia.org/r/758776

Change 758776 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: swap prometheus1003 with prometheus1005

https://gerrit.wikimedia.org/r/758776

Change 759194 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: move pushgateway to prometheus1005

https://gerrit.wikimedia.org/r/759194

Change 759195 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move pushgateway to prometheus1005

https://gerrit.wikimedia.org/r/759195

Change 759194 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: move pushgateway to prometheus1005

https://gerrit.wikimedia.org/r/759194

Change 759195 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move pushgateway to prometheus1005

https://gerrit.wikimedia.org/r/759195

Change 761294 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: move prometheus_nodes to WMCS role-based hierarchy

https://gerrit.wikimedia.org/r/761294

Change 761435 had a related patch set uploaded (by Herron; author: Herron):

[operations/homer/public@master] add new prometheus hosts to labs-in[4,6]

https://gerrit.wikimedia.org/r/761435

Change 761435 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] add new prometheus hosts to labs-in[4,6]

https://gerrit.wikimedia.org/r/761435

Change 761591 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: decom prometheus[12]003

https://gerrit.wikimedia.org/r/761591

Change 761294 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: move prometheus_nodes to WMCS role-based hierarchy

https://gerrit.wikimedia.org/r/761294

Change 761591 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: decom prometheus[12]003

https://gerrit.wikimedia.org/r/761591

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: prometheus2003.codfw.wmnet

  • prometheus2003.codfw.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: prometheus1003.eqiad.wmnet

  • prometheus1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 762453 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: swap prometheus200[46]

https://gerrit.wikimedia.org/r/762453

Change 762453 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: swap prometheus200[46]

https://gerrit.wikimedia.org/r/762453

Change 762766 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: swap prometheus100[46]

https://gerrit.wikimedia.org/r/762766

Change 762766 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: swap prometheus100[46]

https://gerrit.wikimedia.org/r/762766

Change 762825 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Decom prometheus[12]00[34]

https://gerrit.wikimedia.org/r/762825

Change 762825 merged by Filippo Giunchedi:

[operations/puppet@production] Decom prometheus[12]00[34]

https://gerrit.wikimedia.org/r/762825

Change 762827 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/homer/public@master] cr: remove prometheus[12]00[34] from ACLs

https://gerrit.wikimedia.org/r/762827

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: prometheus1004.eqiad.wmnet

  • prometheus1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: prometheus2004.codfw.wmnet

  • prometheus2004.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: prometheus1004.eqiad.wmnet

  • prometheus1004.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • No DNS record found for the mgmt interface prometheus1004.mgmt.eqiad.wmnet, trying the asset tag one: wmf7001.mgmt.eqiad.wmnet
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 762827 merged by Filippo Giunchedi:

[operations/homer/public@master] cr: remove prometheus[12]00[34] from ACLs

https://gerrit.wikimedia.org/r/762827

fgiunchedi claimed this task.

This is complete! Old Prometheus hw is decom'd and new hw in codfw/eqiad is running Bullseye