Page MenuHomePhabricator

RESTBase Cassandra cluster: Increase instance count to 3
Closed, ResolvedPublic

Description

The conversion to multi-instance is now complete in the eqiad datacenter, and is on track for completion in codfw RSN. Our current baseline is an instance count of 2 per host, with the exception of restbase200[1-2].codfw.wmnet, which are already running 3 instances each.

Back-of-napkin: If each instance in eqiad is currently ~1T in size, bumping instance count to 3 should reduce node density to ~682G (based on present storage levels). My expectation is that this will improve read latency by reducing the SSTables/read, put us in a more favorable position to begin incremental repairs, and give the aggressive memory configurations that have been proposed in T125906, a better chance of succeeding.

Based on the outcome of T130540, we can move forward in eqiad without the need to serialize with the on-going expansions in codfw.

See:


Instances to bootstrap

  • 1007-c
  • 1008-c
  • 1009-c
  • 1010-c
  • 1011-c
  • 1012-c
  • 1013-c
  • 1014-c
  • 1015-c
  • 2003-b
  • 2003-c
  • 2004-b
  • 2004-c
  • 2005-b
  • 2005-c
  • 2006-b
  • 2006-c
  • 2007-c
  • 2008-c
  • 2009-c
NOTE: 2016-05-25T16:06:58-05:00: While the bootstraps can run concurrently across data-centers, codfw has more instances to bootstrap, with less initial concurrency, and so it represents the upper bound on completion. Taking into account the evolving per-rack concurrencies and data set sizes, I calculate ~115 hours of total bootstrapping time (or ~4.79 days).

Some cleanup activity has occurred as the expansion has progressed, but one final sweep will be needed on each rack, once all range movements have completed.

Instances to cleanup

  • Eqiad
    • Rack A
      • 1007-a
      • 1007-b
      • 1007-c
      • 1010-a
      • 1010-b
      • 1010-c
      • 1011-a
      • 1011-b
      • 1011-c
    • Rack B
      • 1008-a
      • 1008-b
      • 1008-c
      • 1012-a
      • 1012-b
      • 1012-c
      • 1013-a
      • 1013-b
      • 1013-c
    • Rack D
      • 1009-a
      • 1009-b
      • 1009-c
      • 1014-a
      • 1014-b
      • 1014-c
      • 1015-a
      • 1015-b
      • 1015-c
  • Codfw
    • Rack B
      • 2001-a
      • 2001-b
      • 2001-c
      • 2002-a
      • 2002-b
      • 2002-c
      • 2007-a
      • 2007-b
      • 2007-c
    • Rack C
      • 2003-a
      • 2003-b
      • 2003-c
      • 2004-a
      • 2004-b
      • 2004-c
      • 2008-a
      • 2008-b
      • 2008-c
    • Rack D
      • 2005-a
      • 2005-b
      • 2005-c
      • 2006-a
      • 2006-b
      • 2006-c
      • 2009-a
      • 2009-b
      • 2009-c

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+7 -2
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+2 -0
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+1 -1
operations/puppetproduction+76 -0
operations/software/cassandra-metrics-collectormaster+2 -2
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 300924 merged by Dzahn:
(Re)enable Cassandra instance 1013-c

https://gerrit.wikimedia.org/r/300924

Mentioned in SAL [2016-07-25T19:21:25Z] <urandom> T134016: Bootstrapping restbase1013-c.eqiad.wmnet

Change 300942 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2008-c.codfw.wmnet

https://gerrit.wikimedia.org/r/300942

Change 300942 merged by Dzahn:
Enable Cassandra instance restbase2008-c.codfw.wmnet

https://gerrit.wikimedia.org/r/300942

Mentioned in SAL [2016-07-26T00:48:59Z] <urandom> T134016: Bootstrapping restbase2008-c.codfw.wmnet

Change 301174 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2005-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301174

Change 301176 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase1009-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/301176

Mentioned in SAL [2016-07-26T19:33:40Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:37:41Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-b.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:43:05Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:49:36Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-b.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:54:06Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:58:40Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-b.eqiad.wmnet)

Change 301176 merged by Dzahn:
Enable Cassandra instance restbase1009-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/301176

Mentioned in SAL [2016-07-26T20:23:58Z] <urandom> T134016: Bootstrapping restbase1009-c.eqiad.wmnet

Change 301174 merged by Dzahn:
Enable Cassandra instance restbase2005-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301174

Mentioned in SAL [2016-07-27T14:12:13Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2005-a.codfw.wmnet)

Mentioned in SAL [2016-07-27T14:16:44Z] <urandom> T134016: Cancelling bootstrap of restbase2005-c.codfw.wmnet

Mentioned in SAL [2016-07-27T14:50:33Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2005-b.codfw.wmnet)

Mentioned in SAL [2016-07-27T15:21:56Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2006-a.codfw.wmnet)

Mentioned in SAL [2016-07-27T15:49:17Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2006-b.codfw.wmnet)

Mentioned in SAL [2016-07-27T16:33:33Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2009-a.codfw.wmnet)

Mentioned in SAL [2016-07-27T17:58:07Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2009-b.codfw.wmnet)

Mentioned in SAL [2016-07-27T19:04:20Z] <urandom> T134016: Bootstrapping restbase2005-c.codfw.wmnet

Change 301642 had a related patch set uploaded (by Eevans):
Enable Casssandra instance restbase1014-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/301642

Change 301643 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2006-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301643

Mentioned in SAL [2016-07-28T19:22:15Z] <urandom> T134016: Bootstrapping restbase1014-c.eqiad.wmnet

Mentioned in SAL [2016-07-28T20:25:29Z] <urandom> T134016: Bootstrapping restbase2006-c.codfw.wmnet

Change 301855 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2009-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301855

Change 301855 merged by Dzahn:
Enable Cassandra instance restbase2009-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301855

Mentioned in SAL [2016-07-29T18:37:15Z] <urandom> T134016: Bootstrapping restbase2009-c.codfw.wmnet

Eevans updated the task description. (Show Details)

Change 302263 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase1015-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/302263

Change 302263 merged by Elukey:
Enable Cassandra instance restbase1015-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/302263

Mentioned in SAL [2016-08-01T15:58:13Z] <urandom> T134016: Bootstrapping restbase1015-c.eqiad.wmnet

Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

All instances have been bootstrapped, and all cleanups run. Closing...