Page MenuHomePhabricator

Current status of cloudmetrics and its components
Open, Needs TriagePublic

Description

This task captures the current (May 2023) status of cloudmetrics hosts and their components, and opens the discussion on what their future might be. Please feel free to edit the description if I (@fgiunchedi) missed something.

There are two cloudmetrics hosts, each running the following:

  1. graphite/statsd
  2. grafana
  3. prometheus

Given the big progress that's happened lately (on a multiple-months scale that is, also thanks to the excellent work by @taavi) with metricsinfra project, I'd like to pose the following questions:

  1. Is graphite/statsd on cloudmetrics still actively consumed/useful? (i.e. does anyone look at the metrics?)
  2. Ditto for grafana on cloudmetrics? (i.e. https://grafana-cloud.wikimedia.org). I believe the answer is likely "on its way out" given T333568: Move WMCS dashboards to grafana.wmcloud.org

Ideally both answers are 'no', but regardless I'd like to move the labs instance of Prometheus off cloudmetrics and onto the Prometheus production hardware. My understanding is that this move will help both WMCS (one less component to think about) and Observability (less variance/snowflakes).

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolved JHedden
Resolved JHedden
ResolvedBstorm
Resolvedbd808
ResolvedAndrew
DeclinedNone
OpenNone
Resolved nskaggs
OpenNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvedjbond
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
OpenNone
Opentaavi
OpenNone
Resolvedtaavi
Resolveddcaro
ResolvedAndrew
Resolvedhashar
Resolvedfgiunchedi
Resolvedtaavi
OpenNone
OpenNone

Event Timeline

taavi added subscribers: Andrew, bd808.

Hi!

Is graphite/statsd on cloudmetrics still actively consumed/useful? (i.e. does anyone look at the metrics?)

Not really[1]. There are some grafana-cloud dashboards (and a majority of those are probably unused these days) using it but the equivalent metrics are present on the metricsinfra Prometheus instance.

[1]: Excluding deployment-prep, which is a whole different beast. It's been broken for months already due to hostname changes so I don't think it's a blocker. T241285.

Ditto for grafana on cloudmetrics? (i.e. https://grafana-cloud.wikimedia.org). I believe the answer is likely "on its way out" given T333568: Move WMCS dashboards to grafana.wmcloud.org

Indeed the plan is to replace it with https://grafana.wmcloud.org and https://grafana.wikimedia.org as discussed in T307465: grafana-cloud: Browser access to Prometheus is deprecated.

For both of these, I think a reasonable plan of action would be the following:

  1. Finish migrating WMCS-managed dashboards off grafana-cloud: T333568
  2. Pick some dates when this will happen, and send a cloud-announce@ notification about these changes.
  3. Disable Diamond on the remaining Cloud VPS nodes. This is a one-line hiera change.
  4. Shut down the graphite and statsd services on cloudmetrics nodes.
  5. Export a static dump of the Grafana dashboard configurations, and put it up somewhere (download.wmcloud.org?)
  6. Shut down the Grafana server on cloudmetrics nodes. Redirect it to a Wikitech page explaining where to find the dashboards now.

I've been meaning to do it on my own but haven't yet gotten to it. Wdyt?

See also:

Ideally both answers are 'no', but regardless I'd like to move the labs instance of Prometheus off cloudmetrics and onto the Prometheus production hardware. My understanding is that this move will help both WMCS (one less component to think about) and Observability (less variance/snowflakes).

This sounds reasonable to me. I'd like to use this opportunity to get an equivalent Prometheus instance for the codfw1dev testing deployment which does not have cloudmetrics hardware of its own.

/cc @Andrew @bd808 in case you have comments about the plan above.

Hi!

Is graphite/statsd on cloudmetrics still actively consumed/useful? (i.e. does anyone look at the metrics?)

Not really[1]. There are some grafana-cloud dashboards (and a majority of those are probably unused these days) using it but the equivalent metrics are present on the metricsinfra Prometheus instance.

[1]: Excluding deployment-prep, which is a whole different beast. It's been broken for months already due to hostname changes so I don't think it's a blocker. T241285.

Thank you for the context, I'm happy to know metricsinfra has equivalent metrics! And agreed re: deployment-prep, better to followup on T241285 in case.

Ditto for grafana on cloudmetrics? (i.e. https://grafana-cloud.wikimedia.org). I believe the answer is likely "on its way out" given T333568: Move WMCS dashboards to grafana.wmcloud.org

Indeed the plan is to replace it with https://grafana.wmcloud.org and https://grafana.wikimedia.org as discussed in T307465: grafana-cloud: Browser access to Prometheus is deprecated.

For both of these, I think a reasonable plan of action would be the following:

  1. Finish migrating WMCS-managed dashboards off grafana-cloud: T333568
  2. Pick some dates when this will happen, and send a cloud-announce@ notification about these changes.
  3. Disable Diamond on the remaining Cloud VPS nodes. This is a one-line hiera change.
  4. Shut down the graphite and statsd services on cloudmetrics nodes.
  5. Export a static dump of the Grafana dashboard configurations, and put it up somewhere (download.wmcloud.org?)
  6. Shut down the Grafana server on cloudmetrics nodes. Redirect it to a Wikitech page explaining where to find the dashboards now.

I've been meaning to do it on my own but haven't yet gotten to it. Wdyt?

The plan looks good to me! I'm happy to assist with reviews, merges, etc

See also:

Ideally both answers are 'no', but regardless I'd like to move the labs instance of Prometheus off cloudmetrics and onto the Prometheus production hardware. My understanding is that this move will help both WMCS (one less component to think about) and Observability (less variance/snowflakes).

This sounds reasonable to me. I'd like to use this opportunity to get an equivalent Prometheus instance for the codfw1dev testing deployment which does not have cloudmetrics hardware of its own.

Yes that'd be doable, essentially we would deploy a wmcs Prometheus instance (might as well rename while we're at it) in both eqiad and codfw, each pulling from their respective openstack deployment.

Yes that'd be doable, essentially we would deploy a wmcs Prometheus instance (might as well rename while we're at it) in both eqiad and codfw, each pulling from their respective openstack deployment.

I agree with this plan and with everything commented above.

thanks you both!

  1. Pick some dates when this will happen, and send a cloud-announce@ notification about these changes.

In the interest of moving this forward I've come up with the following dates rather arbitrarily:

  • Disable Diamond everywhere on (or slightly after) Monday, July 17th (in about two weeks' time)
  • Disable access to Graphite and the legacy Grafana instance on Tuesday, August 1st (in about a month)

The logic here is that as we have data equivalent to what Diamond produces so migrating those should be fairly simple, while there might be some more complicated Graphite use cases that we want to give time to migrate (especially as people might be on vacations etc during these months)

I will send announcements on Monday (July 3rd) unless there are major objections.

FWIW the timeline looks good to me, thank you @taavi

Change 968277 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] site: Re-image cloudmetrics hosts as insetup

https://gerrit.wikimedia.org/r/968277

Change 968282 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] O:wmcs::monitoring: drop role

https://gerrit.wikimedia.org/r/968282

Change 968277 merged by Majavah:

[operations/puppet@production] site: Re-image cloudmetrics hosts as insetup

https://gerrit.wikimedia.org/r/968277

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bookworm completed:

  • cloudmetrics1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310270739_taavi_1264680_cloudmetrics1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudmetrics1004.eqiad.wmnet with OS bookworm completed:

  • cloudmetrics1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310270810_taavi_1282198_cloudmetrics1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 968282 merged by Majavah:

[operations/puppet@production] O:wmcs::monitoring: drop role

https://gerrit.wikimedia.org/r/968282

Change 969309 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Remove cloudmetrics Cumin alias

https://gerrit.wikimedia.org/r/969309

Change 969309 merged by Majavah:

[operations/puppet@production] Remove cloudmetrics Cumin alias

https://gerrit.wikimedia.org/r/969309

cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudmetrics[1003-1004].eqiad.wmnet

  • cloudmetrics1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cloudmetrics1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB