Page MenuHomePhabricator

Set up new S3-level replicated storage cluster "apus"
Closed, ResolvedPublic

Description

This task is for setting up a new storage cluster. The expectation is that this will take on some of the storage use cases currently served by the thanos and ms swift clusters. It will demonstrate Ceph's multi-site capability, providing a single S3 end-point that is then replicated between two storage clusters, one per DC. It will follow the inexpensive model of the existing swift clusters of having the bulk storage being on HDDs.

Some existing uses of swift are tracked at T264291: Swift users and their usage with more details.

  • Evaluate whether we need encrypted backend traffic across datacenters for the cluster (likely ipsec)
  • Decide on initial storage policies (replication factor, ssd/hdd, site-local vs global, which should be default, etc)
  • Bring frontends online: T275513 T275511
  • Bring backends online: T276642 T276637
  • Bring up service IPs / LVS and certs
  • Bring up dashboards/monitoring/alerting

Once the service/cluster is up we can start migrating users / use cases (in a different task, TBD)

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+20 -1
operations/puppetproduction+8 -0
operations/puppetproduction+5 -0
operations/puppetproduction+2 -0
operations/docker-images/production-imagesmaster+8 -0
operations/puppetproduction+22 -3
operations/puppetproduction+9 -0
operations/puppetproduction+3 -0
operations/puppetproduction+2 -4
operations/dnsmaster+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/docker-images/production-imagesmaster+7 -0
operations/puppetproduction+4 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+42 -0
operations/puppetproduction+1 -0
operations/puppetproduction+6 -0
operations/puppetproduction+36 -0
operations/puppetproduction+13 -0
operations/puppetproduction+6 -21
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+6 -1
operations/puppetproduction+31 -7
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+45 -6
operations/puppetproduction+1 -1
operations/puppetproduction+168 -7
operations/puppetproduction+12 -0
operations/puppetproduction+9 -3
operations/puppetproduction+5 -0
operations/puppetproduction+4 -0
operations/puppetproduction+59 -1
operations/puppetproduction+1 -1
operations/dnsmaster+4 -3
operations/puppetproduction+4 -0
operations/puppetproduction+289 -3
operations/docker-images/production-imagesmaster+32 -0
operations/puppetproduction+12 -0
operations/puppetproduction+2 -4
operations/puppetproduction+4 -2
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1001.eqiad.wmnet with OS bookworm completed:

  • moss-be1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406181422_mvernon_1154756_moss-be1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm completed:

  • moss-fe1002 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202406181406_mvernon_1151282_moss-fe1002.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406181446_mvernon_1151282_moss-fe1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1003.eqiad.wmnet with OS bookworm completed:

  • moss-be1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406181511_mvernon_1162718_moss-be1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Change #1047117 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] cephadm: limit mgr daemons to _admin-labelled hosts

https://gerrit.wikimedia.org/r/1047117

Change #1047117 merged by MVernon:

[operations/puppet@production] cephadm: limit mgr daemons to _admin-labelled hosts

https://gerrit.wikimedia.org/r/1047117

Change #1047033 merged by MVernon:

[operations/puppet@production] Move moss-fe{1,2}001 back to apus cluster

https://gerrit.wikimedia.org/r/1047033

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm executed with errors:

  • moss-fe1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" moss-fe1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm executed with errors:

  • moss-fe2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" moss-fe2001.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1001.eqiad.wmnet with OS bookworm completed:

  • moss-fe1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406190840_mvernon_1353567_moss-fe1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bookworm completed:

  • moss-fe2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406190851_mvernon_1356709_moss-fe2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1047988 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] conftool-data: add apus entries in codfw & eqiad

https://gerrit.wikimedia.org/r/1047988

Change #1048005 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Discovery setup for apus

https://gerrit.wikimedia.org/r/1048005

Change #1048493 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] conftool-data: updates for apus

https://gerrit.wikimedia.org/r/1048493

Change #1048494 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] conftool-data: services entry for apus

https://gerrit.wikimedia.org/r/1048494

Change #1048495 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] apus: service catalogue entry and lvs::realserver setup

https://gerrit.wikimedia.org/r/1048495

Change #1047988 abandoned by MVernon:

[operations/puppet@production] conftool-data: add apus entries in codfw & eqiad; lvs::realserver to rgws

Reason:

Changes refactored into https://gerrit.wikimedia.org/r/c/operations/puppet/+/1048493 et seq

https://gerrit.wikimedia.org/r/1047988

Change #1048005 abandoned by MVernon:

[operations/puppet@production] Discovery setup for apus

Reason:

Refactored in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1048493/ et seq.

https://gerrit.wikimedia.org/r/1048005

Change #1048493 merged by MVernon:

[operations/puppet@production] conftool-data: updates for apus

https://gerrit.wikimedia.org/r/1048493

Change #1048494 merged by MVernon:

[operations/puppet@production] conftool-data: services entry for apus

https://gerrit.wikimedia.org/r/1048494

Change #1048495 merged by MVernon:

[operations/puppet@production] apus: service catalogue entry and lvs::realserver setup

https://gerrit.wikimedia.org/r/1048495

Change #1049195 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera/apus: move apus service into lvs_setup state

https://gerrit.wikimedia.org/r/1049195

Mentioned in SAL (#wikimedia-operations) [2024-06-24T14:49:03Z] <Emperor> stop puppet on eqiad/codfw lvs prior to apus LVS rollout T279621

Change #1049195 merged by MVernon:

[operations/puppet@production] hiera/apus: move apus service into lvs_setup state

https://gerrit.wikimedia.org/r/1049195

Mentioned in SAL (#wikimedia-operations) [2024-06-24T14:52:13Z] <Emperor> enable/run puppet on codfw lvs for apus LVS rollout T279621

Mentioned in SAL (#wikimedia-operations) [2024-06-24T14:56:53Z] <mvernon@cumin1002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw or A:lvs-low-traffic-codfw and A:lvs (T279621)

Mentioned in SAL (#wikimedia-operations) [2024-06-24T15:01:04Z] <mvernon@cumin1002> END (ERROR) - Cookbook sre.loadbalancer.restart-pybal (exit_code=97) rolling-restart of pybal on A:lvs-secondary-codfw or A:lvs-low-traffic-codfw and A:lvs (T279621)

Mentioned in SAL (#wikimedia-operations) [2024-06-24T15:08:24Z] <Emperor> enable/run puppet on eqiad lvs for apus LVS rollout T279621

Mentioned in SAL (#wikimedia-operations) [2024-06-24T15:11:24Z] <mvernon@cumin1002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T279621)

Mentioned in SAL (#wikimedia-operations) [2024-06-24T15:11:50Z] <mvernon@cumin1002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T279621)

Change #1049218 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: add cluster_label to cephadm::rgw services

https://gerrit.wikimedia.org/r/1049218

Change #1049218 merged by MVernon:

[operations/puppet@production] hiera: add cluster_label to cephadm::rgw services

https://gerrit.wikimedia.org/r/1049218

Change #1049222 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hieradata: set apus service to apus not envoyproxy

https://gerrit.wikimedia.org/r/1049222

Change #1049222 merged by MVernon:

[operations/puppet@production] hieradata: set apus service to apus not envoyproxy

https://gerrit.wikimedia.org/r/1049222

Change #1049227 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hieradata: set apus ProxyFetch url to https

https://gerrit.wikimedia.org/r/1049227

Change #1049227 merged by MVernon:

[operations/puppet@production] hieradata: set apus ProxyFetch url to https

https://gerrit.wikimedia.org/r/1049227

Change #1049235 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: set hostname in apus probe

https://gerrit.wikimedia.org/r/1049235

Change #1049235 merged by MVernon:

[operations/puppet@production] hiera: set hostname in apus probe

https://gerrit.wikimedia.org/r/1049235

Change #1049237 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: also use apus.discovery.wmnet for ProxyFetch

https://gerrit.wikimedia.org/r/1049237

Change #1049237 merged by MVernon:

[operations/puppet@production] hiera: set a suitable hostname for the health checks and probes

https://gerrit.wikimedia.org/r/1049237

Change #1049480 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/docker-images/production-images@master] ceph: install wmf-certificates package

https://gerrit.wikimedia.org/r/1049480

Change #1049480 merged by MVernon:

[operations/docker-images/production-images@master] ceph: install wmf-certificates package

https://gerrit.wikimedia.org/r/1049480

Change #1049560 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: set apus service to lvs_setup

https://gerrit.wikimedia.org/r/1049560

Change #1049560 merged by MVernon:

[operations/puppet@production] hiera: set apus service to lvs_setup

https://gerrit.wikimedia.org/r/1049560

Change #1054344 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: mark apus service as in production

https://gerrit.wikimedia.org/r/1054344

Change #1054346 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/dns@master] apus: add active/active geoip service record

https://gerrit.wikimedia.org/r/1054346

Change #1054347 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: use discovery hostname in apus probes

https://gerrit.wikimedia.org/r/1054347

MatthewVernon renamed this task from Set up Misc Object Storage Service (moss) to Set up new S3-level replicated storage cluster "apus".Jul 16 2024, 1:43 PM
MatthewVernon changed the task status from Stalled to Open.
MatthewVernon updated the task description. (Show Details)

Task updated to reflect name change, updates to technology and scope, and to update to state of progress.

Change #1054864 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] cephadm::target mask the podman-auto-update service

https://gerrit.wikimedia.org/r/1054864

Change #1054344 merged by MVernon:

[operations/puppet@production] hiera: mark apus service as in production

https://gerrit.wikimedia.org/r/1054344

Mentioned in SAL (#wikimedia-operations) [2024-07-17T15:16:37Z] <sukhe> cumin 'A:dnsbox' 'run-puppet-agent': T279621

Change #1054346 merged by MVernon:

[operations/dns@master] apus: add active/active geoip service record

https://gerrit.wikimedia.org/r/1054346

Change #1054347 merged by MVernon:

[operations/puppet@production] hiera: use discovery hostname in apus probes

https://gerrit.wikimedia.org/r/1054347

Change #1054864 merged by MVernon:

[operations/puppet@production] cephadm::target mask the podman-auto-update service

https://gerrit.wikimedia.org/r/1054864

Change #1058575 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] cluster::management - add s3client profile

https://gerrit.wikimedia.org/r/1058575

Change #1058575 abandoned by MVernon:

[operations/puppet@production] cluster::management - add s3client profile

Reason:

mediabackup::worker are available in both DCs an have s3cmd on them.

https://gerrit.wikimedia.org/r/1058575

Change #1063196 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] cephadm: separate templates for zonegroup setup and rgw placement

https://gerrit.wikimedia.org/r/1063196

Change #1063196 merged by MVernon:

[operations/puppet@production] cephadm: separate templates for zonegroup setup and rgw placement

https://gerrit.wikimedia.org/r/1063196

Change #1065187 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/docker-images/production-images@master] ceph: add the LABEL ceph=True to the image

https://gerrit.wikimedia.org/r/1065187

Change #1065187 merged by MVernon:

[operations/docker-images/production-images@master] ceph: add the LABEL ceph=True to the image

https://gerrit.wikimedia.org/r/1065187

Change #1075027 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: specify cluster for apus nodes

https://gerrit.wikimedia.org/r/1075027

@MatthewVernon - FYI, while reviewing the logs from the first part of the switchover earlier today, I noticed that apus is depooled everywhere, and thus was ignored by the cookbook:

$ confctl --object-type discovery select 'dnsdisc=apus' get
{"eqiad": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=apus"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=apus"}

In practice, this is fine for the moment, since for an active-active service, having both depooled is effectively the same as if they were both pooled.

However, next week, when we re-pool eqiad, this will turn into eqiad-pooled-only, which may be a surprising state if allowed to linger. In the interim, it might make sense to pool this service in codfw.

Change #1075027 merged by MVernon:

[operations/puppet@production] hiera: specify cluster for apus nodes

https://gerrit.wikimedia.org/r/1075027

Change #1076158 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: add apus to wikimedia_clusters

https://gerrit.wikimedia.org/r/1076158

Change #1076158 merged by MVernon:

[operations/puppet@production] hiera: add apus to wikimedia_clusters

https://gerrit.wikimedia.org/r/1076158

Change #1083769 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] cephadm: bump fs.aio-max-nr and kernel.pid_max

https://gerrit.wikimedia.org/r/1083769

Change #1083769 merged by MVernon:

[operations/puppet@production] cephadm: bump fs.aio-max-nr and kernel.pid_max

https://gerrit.wikimedia.org/r/1083769

Change #1084174 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Scrape the cephadm cluster endpoint

https://gerrit.wikimedia.org/r/1084174

Change #1084174 merged by Cwhite:

[operations/puppet@production] Scrape the cephadm cluster endpoint

https://gerrit.wikimedia.org/r/1084174

@MatthewVernon cephadm clusters are now being scraped, however the ones in codfw (moss-be200[123]) don't appear to have anything listening to port 9283

Change #1084710 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] prometheus: set cephadm scrape interval to 60s

https://gerrit.wikimedia.org/r/1084710

Change #1084710 merged by MVernon:

[operations/puppet@production] prometheus: set cephadm scrape interval to 60s

https://gerrit.wikimedia.org/r/1084710

@MatthewVernon cephadm clusters are now being scraped, however the ones in codfw (moss-be200[123]) don't appear to have anything listening to port 9283

They should have a listener now.

Change #1085434 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] set apus scrape interval to 15s

https://gerrit.wikimedia.org/r/1085434

Change #1085434 merged by MVernon:

[operations/puppet@production] set apus scrape interval to 15s

https://gerrit.wikimedia.org/r/1085434

Change #1085617 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] service::catalog: mark apus service as paging

https://gerrit.wikimedia.org/r/1085617

Change #1085617 merged by MVernon:

[operations/puppet@production] service::catalog: mark apus service as paging

https://gerrit.wikimedia.org/r/1085617

MatthewVernon claimed this task.
MatthewVernon updated the task description. (Show Details)

Aiming to migrate first production user this quarter.