End of August milestone: Cassandra 3 cluster in production
Closed, ResolvedPublic

Description

Set up a Cassandra 3 cluster in production:

  • Free up space (9T) in production cluster
  • Decommission instances (freeing up one host per rack, per DC)
    • 2001
    • 2003
    • 2005
    • 1010
    • 1008
    • 1009
  • Puppetize a new cluster
    • Same hostnames, and host and instance IPs
    • New cluster name (pending a vigorous bikeshedding)
    • New role (pending outcome of a vigorous and thorough bikeshedding)
    • Using Prometheus for metrics storage
  • Assemble new cluster
  • Set up new (Prometheus-based) dashboards
  • Document
There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 372452 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] Lower the concurrency of transcludes to 100

https://gerrit.wikimedia.org/r/372452

restbase2001.codfw.wmnet has been re-imaged, but there are a couple of issues yet to resolve:

[ ... ]

Secondly, the agreed upon disk/mount layout doesn't provide for a common location to store commitlogs, hints, saved caches and heapdumps (can't believe I missed this).

Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            26G   19M   26G   1% /run
/dev/md0         28G  2.0G   25G   8% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/sda3       911G   72M  911G   1% /srv/sda
/dev/sdb3       911G   72M  911G   1% /srv/sdb
/dev/sdc3       911G   72M  911G   1% /srv/sdc
/dev/sde3       911G   72M  911G   1% /srv/sde
/dev/sdd3       911G   72M  911G   1% /srv/sdd

[ ... ]

... I'm not certain what the best course of action is. Trade-offs abound...

Regarding trade-offs, it seems like the choices here can be distilled down to:

  1. A single location for all commitlogs
    1. Create an additional partition on every disk, combine them in a RAID-0
      • PROS: performance
      • CONS: large blast radius (one disk failure takes out all commitlogs)
    2. Create an additional partition on every disk, combine them in a RAID-1(e)
      • PROS: fault tolerance
      • CONS: performance(?)
    3. Store commitlogs on the same device as the OS (a RAID-1, which is not currently large enough)
      • PROS: fault tolerance, fewer partitions
      • CONS: ugly, performance(?)
  2. Per instance storage of commitlogs
    1. Allocate one device per instance
      • PROS: fault tolerant(ish), performant(ish)
      • CONS: ugly, confusing, poor distribution of load (number of instances != number of disks)

I'm inclined to think that the performance of a single RAID-1 array might be Good Enough for commitlogs, at which point 1B seems most attractive.

Thoughts? (/cc @fgiunchedi )

Change 372469 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Use absolute paths for data_file_directories

https://gerrit.wikimedia.org/r/372469

restbase2001.codfw.wmnet has been re-imaged, but there are a couple of issues yet to resolve:

[ ... ]

Secondly, the agreed upon disk/mount layout doesn't provide for a common location to store commitlogs, hints, saved caches and heapdumps (can't believe I missed this).

Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            26G   19M   26G   1% /run
/dev/md0         28G  2.0G   25G   8% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/sda3       911G   72M  911G   1% /srv/sda
/dev/sdb3       911G   72M  911G   1% /srv/sdb
/dev/sdc3       911G   72M  911G   1% /srv/sdc
/dev/sde3       911G   72M  911G   1% /srv/sde
/dev/sdd3       911G   72M  911G   1% /srv/sdd

[ ... ]

... I'm not certain what the best course of action is. Trade-offs abound...

Regarding trade-offs, it seems like the choices here can be distilled down to:

  1. A single location for all commitlogs
    1. Create an additional partition on every disk, combine them in a RAID-0
      • PROS: performance
      • CONS: large blast radius (one disk failure takes out all commitlogs)
    2. Create an additional partition on every disk, combine them in a RAID-1(e)
      • PROS: fault tolerance
      • CONS: performance(?)
    3. Store commitlogs on the same device as the OS (a RAID-1, which is not currently large enough)
      • PROS: fault tolerance, fewer partitions
      • CONS: ugly, performance(?)
  2. Per instance storage of commitlogs
    1. Allocate one device per instance
      • PROS: fault tolerant(ish), performant(ish)
      • CONS: ugly, confusing, poor distribution of load (number of instances != number of disks)

        I'm inclined to think that the performance of a single RAID-1 array might be Good Enough for commitlogs, at which point 1B seems most attractive.

I agree raid1 in this case seems best, also because what we're trying to do with jbod is also reduce the blast radius of a failed disk.

Size wise it seems commit logs and caches are no larger than 25GB now, though I don't know how much that varies over time.

Performance wise writes will be of course as slow as the slowest active device, reads will be performed from any one of the active devices iirc. For the OS raid1 all disks are active and no spares, we could do sth similar too.

[ ... ]

... I'm not certain what the best course of action is. Trade-offs abound...

Regarding trade-offs, it seems like the choices here can be distilled down to:

  1. A single location for all commitlogs
    1. Create an additional partition on every disk, combine them in a RAID-0
      • PROS: performance
      • CONS: large blast radius (one disk failure takes out all commitlogs)
    2. Create an additional partition on every disk, combine them in a RAID-1(e)
      • PROS: fault tolerance
      • CONS: performance(?)
    3. Store commitlogs on the same device as the OS (a RAID-1, which is not currently large enough)
      • PROS: fault tolerance, fewer partitions
      • CONS: ugly, performance(?)
  2. Per instance storage of commitlogs
    1. Allocate one device per instance
      • PROS: fault tolerant(ish), performant(ish)
      • CONS: ugly, confusing, poor distribution of load (number of instances != number of disks)

        I'm inclined to think that the performance of a single RAID-1 array might be Good Enough for commitlogs, at which point 1B seems most attractive.

I agree raid1 in this case seems best, also because what we're trying to do with jbod is also reduce the blast radius of a failed disk.

Size wise it seems commit logs and caches are no larger than 25GB now, though I don't know how much that varies over time.

The configuration makes it the smaller of 8GB or 1/4 the size of the device. We've always left open the possibility of a 4th instance, so maybe 32GB? Or 40GB for good measure?

Performance wise writes will be of course as slow as the slowest active device, reads will be performed from any one of the active devices iirc. For the OS raid1 all disks are active and no spares, we could do sth similar too.

Yeah, sadly commitlogs are for all intents and purposes write-only (they're only read on crash-recovery), so it seems like the wrong trade-off from this perspective. I'm just not sure what would be better.

Yeah, sadly commitlogs are for all intents and purposes write-only (they're only read on crash-recovery), so it seems like the wrong trade-off from this perspective. I'm just not sure what would be better.

Indeed, there's no obviously-correct solution so I'm proposing to postpone JBOD tests to later. In the interest of both the end-of-August deadline and the fact that cassandra 3/new storage engine and JBOD are decoupled (though the former is a requirement for the latter).
IOW skip JBOD now and proceed with the new cluster plan, but keeping the usual raid0 layout.

Yeah, sadly commitlogs are for all intents and purposes write-only (they're only read on crash-recovery), so it seems like the wrong trade-off from this perspective. I'm just not sure what would be better.

Indeed, there's no obviously-correct solution so I'm proposing to postpone JBOD tests to later. In the interest of both the end-of-August deadline and the fact that cassandra 3/new storage engine and JBOD are decoupled (though the former is a requirement for the latter).
IOW skip JBOD now and proceed with the new cluster plan, but keeping the usual raid0 layout.

This seems like a terrible waste of an opportunity to eliminate the raid0 blast radius, but I'm kind of inclined to agree. If we are serious about making this end-of-August milestone, then we should probably eliminate this aspect of change.

When would JBOD be pushed out for, then? September of later? If the latter, then we would need to reformat/reimage the nodes, which might be a lengthier process since the hope/idea is to be using Cass 3.x in production by then.

When would JBOD be pushed out for, then? September of later? If the latter, then we would need to reformat/reimage the nodes, which might be a lengthier process since the hope/idea is to be using Cass 3.x in production by then.

I guess we'd be pushing it out indefinitely. I can't really see us converting the nodes later (that would be an awful lot of work).

Change 373335 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Transcludes: Ramp up the concurrency to 200

https://gerrit.wikimedia.org/r/373335

Change 373335 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] Transcludes: Ramp up the concurrency to 200

https://gerrit.wikimedia.org/r/373335

Mentioned in SAL (#wikimedia-operations) [2017-08-23T20:59:15Z] <urandom> T169939: Truncating MCS tables

Mentioned in SAL (#wikimedia-operations) [2017-08-24T16:34:48Z] <urandom> T169939: Decommission Cassandra: restbase2003-a.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-24T18:24:45Z] <urandom> T169939: Decommission Cassandra: restbase2003-b.codfw.wmnet

Eevans updated the task description. (Show Details)Aug 24 2017, 7:05 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-24T19:58:51Z] <urandom> T169939: Decommission Cassandra: restbase2003-c.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-25T09:12:33Z] <godog> reimage restbase2001 to test new partman recipe - T169939

Change 373863 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: add partman for cassandra JBOD

https://gerrit.wikimedia.org/r/373863

Eevans updated the task description. (Show Details)Aug 26 2017, 7:03 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-28T17:56:45Z] <urandom> T169939: Decommission Cassandra: restbase2005-b.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-28T20:52:30Z] <urandom> T169939: Decommissioning restbase1010-c.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-28T22:46:09Z] <urandom> T169939: Decommissioning restbase1010-b.eqiad.wmnet

Eevans updated the task description. (Show Details)Aug 28 2017, 11:26 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-29T00:25:17Z] <urandom> T169939: Decommissioning restbase1010-a.eqiad.wmnet

Eevans updated the task description. (Show Details)Aug 29 2017, 2:13 AM

Mentioned in SAL (#wikimedia-operations) [2017-08-29T15:54:07Z] <urandom> T169939: Decommission Cassandra: restbase2005-c.codfw.wmnet

Eevans updated the task description. (Show Details)Aug 29 2017, 5:59 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-29T18:15:29Z] <urandom> T169939: Decommission Cassandra: restbase1008-a.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-29T19:47:34Z] <urandom> T169939: Decommission Cassandra: restbase1008-b.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-29T21:33:07Z] <urandom> T169939: Decommission Cassandra: restbase1008-c.eqiad.wmnet

Eevans updated the task description. (Show Details)Aug 29 2017, 11:47 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-30T15:43:20Z] <urandom> T169939: Decommission Cassandra: restbase1009-a.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-30T17:29:10Z] <urandom> T169939: Decommission Cassandra: restbase1009-b.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2017-08-30T20:07:34Z] <urandom> T169939: Decommission Cassandra: restbase1009-c.eqiad.wmnet

To summarize the discussion in https://gerrit.wikimedia.org/r/373863 (and elsewhere):

Linux MD is capable of some interesting, non-standard forms of RAID-10. For example, near mode where "...each chunk is repeated n times in a k-way stripe array, is equivalent to the standard RAID 10 arrangement, but it does not require that n evenly divides k." (https://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10). This would make better use of our 4 and 5 disk configurations, than a standard RAID-1 would.

Obviously, the mirroring still increases the IO burden of commitlogs relative to what we see in our RAID-0 configs, but this seems like a good compromise (and commitlog-generated IO is relative low compared to the overall IO of Cassandra node). We can monitor utilization closely during testing, and if this seems too onerous, then worst-case scenario would be to migrate the commitlogs to individual volumes (leaving the RAID-10 unused).

Eevans updated the task description. (Show Details)Aug 31 2017, 12:22 AM
Eevans updated the task description. (Show Details)

All decommissioning is now complete, the following hosts are free to be configured into the new cluster:

  • restbase1010.eqiad.wmnet
  • restbase1008.eqiad.wmnet
  • restbase1009.eqiad.wmnet
  • restbase2003.codfw.wmnet
  • restbase2001.codfw.wmnet
  • restbase2005.codfw.wmnet

Change 373863 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: add partman for cassandra JBOD

https://gerrit.wikimedia.org/r/373863

Mentioned in SAL (#wikimedia-operations) [2017-08-31T09:01:30Z] <godog> test reimage of restbase2001 - T169939

Change 374994 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: jbod devices configuration

https://gerrit.wikimedia.org/r/374994

Change 374994 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: jbod devices configuration

https://gerrit.wikimedia.org/r/374994

Change 375048 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Instance-configurable heapdump_directory

https://gerrit.wikimedia.org/r/375048

Change 375049 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Configure disk_failure_policy: best_effort

https://gerrit.wikimedia.org/r/375049

Open changesets:

Change 372469 merged by Filippo Giunchedi:
[operations/puppet@production] Use absolute paths for data_file_directories

https://gerrit.wikimedia.org/r/372469

Change 375048 merged by Filippo Giunchedi:
[operations/puppet@production] Instance-configurable heapdump_directory

https://gerrit.wikimedia.org/r/375048

Change 375049 merged by Filippo Giunchedi:
[operations/puppet@production] Configure disk_failure_policy: best_effort

https://gerrit.wikimedia.org/r/375049

Mentioned in SAL (#wikimedia-operations) [2017-09-01T08:59:51Z] <godog> depool restbase200[135] before reimage - T169939

Change 375392 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Pin current Cassandra build (3.11.0-wmf3)

https://gerrit.wikimedia.org/r/375392

Change 375392 merged by Giuseppe Lavagetto:
[operations/puppet@production] Pin current Cassandra build (3.11.0-wmf3)

https://gerrit.wikimedia.org/r/375392

Change 375400 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Fixup (obviously) typo'd data_file_directories entries

https://gerrit.wikimedia.org/r/375400

Change 375414 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Allow more per-instance overrides

https://gerrit.wikimedia.org/r/375414

Change 375415 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Configure restbase2001 instance data paths

https://gerrit.wikimedia.org/r/375415

Change 375400 merged by Filippo Giunchedi:
[operations/puppet@production] Fixup (obviously) typo'd data_file_directories entries

https://gerrit.wikimedia.org/r/375400

Change 375414 merged by Filippo Giunchedi:
[operations/puppet@production] Allow more per-instance overrides

https://gerrit.wikimedia.org/r/375414

Change 375415 merged by Filippo Giunchedi:
[operations/puppet@production] Configure restbase2001 instance data paths

https://gerrit.wikimedia.org/r/375415

Change 375774 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: reprovision restbase2003 with cassandra 3

https://gerrit.wikimedia.org/r/375774

Change 375775 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: add execute permission to jbod mountpoint

https://gerrit.wikimedia.org/r/375775

Change 375775 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: add execute permission to jbod mountpoint

https://gerrit.wikimedia.org/r/375775

Change 375774 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: reprovision restbase2003 with cassandra 3

https://gerrit.wikimedia.org/r/375774

Mentioned in SAL (#wikimedia-operations) [2017-09-04T14:51:02Z] <godog> reimage restbase2003 for use with cassandra 3 / jbod - T169939

Change 375825 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] RESTBase: Remove restbase10(0[89]|10) from the list of seeds

https://gerrit.wikimedia.org/r/375825

Change 375825 merged by Filippo Giunchedi:
[operations/puppet@production] RESTBase: Remove restbase10(0[89]|10) from the list of seeds

https://gerrit.wikimedia.org/r/375825

Mentioned in SAL (#wikimedia-operations) [2017-09-04T16:10:21Z] <mobrovac> restbase depooled restbase10(0[89]|10) and restbase2005 for T169939

Change 375829 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Remove restbase10(0[89]|10) and restbase200[135] from targets

https://gerrit.wikimedia.org/r/375829

Change 375829 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Remove restbase10(0[89]|10) and restbase200[135] from targets

https://gerrit.wikimedia.org/r/375829

Change 376007 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: reprovision restbase2005 with cassandra 3

https://gerrit.wikimedia.org/r/376007

Mentioned in SAL (#wikimedia-operations) [2017-09-05T13:28:21Z] <godog> reimage restbase2005 - T169939

Change 376007 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: reprovision restbase2005 with cassandra 3

https://gerrit.wikimedia.org/r/376007

Change 376195 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: reprovision restbase1009 with cassandra 3

https://gerrit.wikimedia.org/r/376195

Change 376195 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: reprovision restbase1009 with cassandra 3

https://gerrit.wikimedia.org/r/376195

Mentioned in SAL (#wikimedia-operations) [2017-09-06T08:53:32Z] <godog> reimage restbase1009 with cassandra 3 - T169939

Change 376209 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: reprovision restbase1010 with cassandra 3

https://gerrit.wikimedia.org/r/376209

Change 376209 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: reprovision restbase1010 with cassandra 3

https://gerrit.wikimedia.org/r/376209

Mentioned in SAL (#wikimedia-operations) [2017-09-06T10:12:57Z] <godog> reimage restbase1010 with cassandra 3 - T169939

Change 376219 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: reprovision restbase1008 with cassandra 3

https://gerrit.wikimedia.org/r/376219

Change 376219 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: reprovision restbase1008 with cassandra 3

https://gerrit.wikimedia.org/r/376219

Mentioned in SAL (#wikimedia-operations) [2017-09-06T11:20:41Z] <godog> reimage restbase1008 with cassandra 3 - T169939

Change 376267 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cassandra: enable jmx_exporter for cassandra 3 cluster

https://gerrit.wikimedia.org/r/376267

Change 376267 merged by Filippo Giunchedi:
[operations/puppet@production] cassandra: enable jmx_exporter for cassandra 3 cluster

https://gerrit.wikimedia.org/r/376267

Eevans updated the task description. (Show Details)Sep 6 2017, 4:26 PM

The cassandra 3 in production is indeed up now, a couple of followups left to do in Puppet:

  • Disable cassandra-metrics-collector if jmx_exporter is enabled
  • Create parent directories for jbod data_directories and e.g. commitlog directories
Eevans added a comment.Sep 7 2017, 3:35 PM

The cassandra 3 in production is indeed up now, a couple of followups left to do in Puppet:

  • Disable cassandra-metrics-collector if jmx_exporter is enabled
  • Create parent directories for jbod data_directories and e.g. commitlog directories

I've updated T171772: Prometheus metrics storage for RESTBase dev environment and created T175284: Create parent directories for JBOD data_directories and e.g. commitlog directories respectively, for these.

Dzahn added a subscriber: Dzahn.Sep 11 2017, 6:54 PM

14:49 < icinga-wm> PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
14:52 < icinga-wm> PROBLEM - Check systemd state on restbase1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
..

@restbase2003:~# systemctl status cassandra-metrics-collector
● cassandra-metrics-collector.service - cassandra metrics collector
   Loaded: loaded (/lib/systemd/system/cassandra-metrics-collector.service; static)
   Active: failed (Result: exit-code) since Mon 2017-09-11 18:31:06 UTC; 21min ago
Sep 11 18:31:06 restbase2003 systemd[1]: Started cassandra metrics collector.
Sep 11 18:31:06 restbase2003 java[12418]: Error: Could not find or load main class org.wikimedia.cassandra.metrics.service.Service
Sep 11 18:31:06 restbase2003 systemd[1]: cassandra-metrics-collector.service: main process exited, code=exited, status=1/FAILURE
Sep 11 18:31:06 restbase2003 systemd[1]: Unit cassandra-metrics-collector.service entered failed state.

14:49 < icinga-wm> PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
14:52 < icinga-wm> PROBLEM - Check systemd state on restbase1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
..

@restbase2003:~# systemctl status cassandra-metrics-collector
● cassandra-metrics-collector.service - cassandra metrics collector
   Loaded: loaded (/lib/systemd/system/cassandra-metrics-collector.service; static)
   Active: failed (Result: exit-code) since Mon 2017-09-11 18:31:06 UTC; 21min ago
Sep 11 18:31:06 restbase2003 systemd[1]: Started cassandra metrics collector.
Sep 11 18:31:06 restbase2003 java[12418]: Error: Could not find or load main class org.wikimedia.cassandra.metrics.service.Service
Sep 11 18:31:06 restbase2003 systemd[1]: cassandra-metrics-collector.service: main process exited, code=exited, status=1/FAILURE
Sep 11 18:31:06 restbase2003 systemd[1]: Unit cassandra-metrics-collector.service entered failed state.

I think this was under maintenance, pending the removal of cassandra-metrics-collector (we're going to use Prometheus on this cluster; See: T171772: Prometheus metrics storage for RESTBase dev environment). I've put it back under maintenance, and will put together a changeset to uninstall the service before it expires again.

Change 379539 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Scap: Readd restbase10(0[89]|10) and restbase200[135] to targets

https://gerrit.wikimedia.org/r/379539

Eevans added a comment.Oct 2 2017, 3:27 PM

Prometheus dashboards (still a work-in-progress), are here and here

Eevans updated the task description. (Show Details)Oct 2 2017, 3:27 PM
Eevans closed this task as Resolved.