Page MenuHomePhabricator

RESTBase Cassandra cluster: Increase instance count to 3
Closed, ResolvedPublic

Description

The conversion to multi-instance is now complete in the eqiad datacenter, and is on track for completion in codfw RSN. Our current baseline is an instance count of 2 per host, with the exception of restbase200[1-2].codfw.wmnet, which are already running 3 instances each.

Back-of-napkin: If each instance in eqiad is currently ~1T in size, bumping instance count to 3 should reduce node density to ~682G (based on present storage levels). My expectation is that this will improve read latency by reducing the SSTables/read, put us in a more favorable position to begin incremental repairs, and give the aggressive memory configurations that have been proposed in T125906, a better chance of succeeding.

Based on the outcome of T130540, we can move forward in eqiad without the need to serialize with the on-going expansions in codfw.

See:


Instances to bootstrap

  • 1007-c
  • 1008-c
  • 1009-c
  • 1010-c
  • 1011-c
  • 1012-c
  • 1013-c
  • 1014-c
  • 1015-c
  • 2003-b
  • 2003-c
  • 2004-b
  • 2004-c
  • 2005-b
  • 2005-c
  • 2006-b
  • 2006-c
  • 2007-c
  • 2008-c
  • 2009-c
NOTE: 2016-05-25T16:06:58-05:00: While the bootstraps can run concurrently across data-centers, codfw has more instances to bootstrap, with less initial concurrency, and so it represents the upper bound on completion. Taking into account the evolving per-rack concurrencies and data set sizes, I calculate ~115 hours of total bootstrapping time (or ~4.79 days).

Some cleanup activity has occurred as the expansion has progressed, but one final sweep will be needed on each rack, once all range movements have completed.

Instances to cleanup

  • Eqiad
    • Rack A
      • 1007-a
      • 1007-b
      • 1007-c
      • 1010-a
      • 1010-b
      • 1010-c
      • 1011-a
      • 1011-b
      • 1011-c
    • Rack B
      • 1008-a
      • 1008-b
      • 1008-c
      • 1012-a
      • 1012-b
      • 1012-c
      • 1013-a
      • 1013-b
      • 1013-c
    • Rack D
      • 1009-a
      • 1009-b
      • 1009-c
      • 1014-a
      • 1014-b
      • 1014-c
      • 1015-a
      • 1015-b
      • 1015-c
  • Codfw
    • Rack B
      • 2001-a
      • 2001-b
      • 2001-c
      • 2002-a
      • 2002-b
      • 2002-c
      • 2007-a
      • 2007-b
      • 2007-c
    • Rack C
      • 2003-a
      • 2003-b
      • 2003-c
      • 2004-a
      • 2004-b
      • 2004-c
      • 2008-a
      • 2008-b
      • 2008-c
    • Rack D
      • 2005-a
      • 2005-b
      • 2005-c
      • 2006-a
      • 2006-b
      • 2006-c
      • 2009-a
      • 2009-b
      • 2009-c

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+7 -2
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+2 -0
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+1 -1
operations/puppetproduction+76 -0
operations/software/cassandra-metrics-collectormaster+2 -2
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 300924 merged by Dzahn:
(Re)enable Cassandra instance 1013-c

https://gerrit.wikimedia.org/r/300924

Mentioned in SAL [2016-07-25T19:21:25Z] <urandom> T134016: Bootstrapping restbase1013-c.eqiad.wmnet

Change 300942 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2008-c.codfw.wmnet

https://gerrit.wikimedia.org/r/300942

Change 300942 merged by Dzahn:
Enable Cassandra instance restbase2008-c.codfw.wmnet

https://gerrit.wikimedia.org/r/300942

Mentioned in SAL [2016-07-26T00:48:59Z] <urandom> T134016: Bootstrapping restbase2008-c.codfw.wmnet

Eevans updated the task description. (Show Details)Jul 26 2016, 4:25 PM

Change 301174 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2005-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301174

Change 301176 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase1009-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/301176

Eevans updated the task description. (Show Details)Jul 26 2016, 7:31 PM

Mentioned in SAL [2016-07-26T19:33:40Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:37:41Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-b.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:43:05Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:49:36Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-b.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:54:06Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-a.eqiad.wmnet)

Mentioned in SAL [2016-07-26T19:58:40Z] <urandom> T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-b.eqiad.wmnet)

Change 301176 merged by Dzahn:
Enable Cassandra instance restbase1009-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/301176

Mentioned in SAL [2016-07-26T20:23:58Z] <urandom> T134016: Bootstrapping restbase1009-c.eqiad.wmnet

Change 301174 merged by Dzahn:
Enable Cassandra instance restbase2005-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301174

Mentioned in SAL [2016-07-27T14:12:13Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2005-a.codfw.wmnet)

Mentioned in SAL [2016-07-27T14:16:44Z] <urandom> T134016: Cancelling bootstrap of restbase2005-c.codfw.wmnet

Mentioned in SAL [2016-07-27T14:50:33Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2005-b.codfw.wmnet)

Mentioned in SAL [2016-07-27T15:21:56Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2006-a.codfw.wmnet)

Mentioned in SAL [2016-07-27T15:49:17Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2006-b.codfw.wmnet)

Mentioned in SAL [2016-07-27T16:33:33Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2009-a.codfw.wmnet)

Mentioned in SAL [2016-07-27T17:58:07Z] <urandom> T134016: Restarting Cassandra instance to apply disabled streaming socket timeout (restbase2009-b.codfw.wmnet)

Mentioned in SAL [2016-07-27T19:04:20Z] <urandom> T134016: Bootstrapping restbase2005-c.codfw.wmnet

Eevans updated the task description. (Show Details)Jul 28 2016, 2:57 PM
Eevans updated the task description. (Show Details)Jul 28 2016, 3:49 PM

Change 301642 had a related patch set uploaded (by Eevans):
Enable Casssandra instance restbase1014-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/301642

Change 301643 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2006-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301643

Mentioned in SAL [2016-07-28T19:22:15Z] <urandom> T134016: Bootstrapping restbase1014-c.eqiad.wmnet

Mentioned in SAL [2016-07-28T20:25:29Z] <urandom> T134016: Bootstrapping restbase2006-c.codfw.wmnet

Eevans updated the task description. (Show Details)Jul 29 2016, 2:23 PM

Change 301855 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase2009-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301855

Change 301855 merged by Dzahn:
Enable Cassandra instance restbase2009-c.codfw.wmnet

https://gerrit.wikimedia.org/r/301855

Mentioned in SAL [2016-07-29T18:37:15Z] <urandom> T134016: Bootstrapping restbase2009-c.codfw.wmnet

Eevans updated the task description. (Show Details)Jul 29 2016, 9:30 PM
Eevans updated the task description. (Show Details)Jul 30 2016, 3:52 AM
Eevans updated the task description. (Show Details)Jul 30 2016, 12:21 PM
Eevans updated the task description. (Show Details)Jul 31 2016, 12:48 AM
Eevans updated the task description. (Show Details)Jul 31 2016, 12:53 AM
Eevans updated the task description. (Show Details)Jul 31 2016, 1:04 AM
Eevans updated the task description. (Show Details)

Change 302263 had a related patch set uploaded (by Eevans):
Enable Cassandra instance restbase1015-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/302263

Change 302263 merged by Elukey:
Enable Cassandra instance restbase1015-c.eqiad.wmnet

https://gerrit.wikimedia.org/r/302263

Mentioned in SAL [2016-08-01T15:58:13Z] <urandom> T134016: Bootstrapping restbase1015-c.eqiad.wmnet

Eevans updated the task description. (Show Details)Aug 1 2016, 4:07 PM
Eevans updated the task description. (Show Details)Aug 2 2016, 4:19 PM
Eevans updated the task description. (Show Details)Aug 3 2016, 1:49 AM
Eevans updated the task description. (Show Details)Aug 3 2016, 4:26 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Aug 3 2016, 4:29 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Aug 3 2016, 4:34 PM
Eevans updated the task description. (Show Details)Aug 3 2016, 4:37 PM
Eevans updated the task description. (Show Details)Aug 4 2016, 4:19 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Aug 4 2016, 4:22 PM
Eevans updated the task description. (Show Details)Aug 5 2016, 3:31 PM
Eevans updated the task description. (Show Details)Aug 5 2016, 3:33 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Aug 8 2016, 2:49 PM
Eevans updated the task description. (Show Details)Aug 8 2016, 2:54 PM
Eevans updated the task description. (Show Details)Aug 8 2016, 2:58 PM
Eevans updated the task description. (Show Details)Aug 9 2016, 4:47 PM
Eevans updated the task description. (Show Details)Aug 9 2016, 4:52 PM
Eevans updated the task description. (Show Details)Aug 9 2016, 4:57 PM
Eevans updated the task description. (Show Details)Aug 10 2016, 3:47 PM
Eevans updated the task description. (Show Details)Aug 10 2016, 3:50 PM
Eevans updated the task description. (Show Details)Aug 11 2016, 7:34 PM
Eevans updated the task description. (Show Details)Aug 11 2016, 7:45 PM
Eevans updated the task description. (Show Details)Aug 12 2016, 3:05 PM
Eevans updated the task description. (Show Details)Aug 13 2016, 5:35 PM
Eevans updated the task description. (Show Details)Aug 14 2016, 1:24 AM
Eevans updated the task description. (Show Details)Aug 15 2016, 3:12 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Aug 18 2016, 12:51 AM
Eevans updated the task description. (Show Details)Aug 18 2016, 2:43 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Aug 18 2016, 11:41 PM
Eevans updated the task description. (Show Details)Aug 22 2016, 2:51 PM
Eevans updated the task description. (Show Details)Aug 24 2016, 2:59 PM
Eevans closed this task as Resolved.Aug 29 2016, 4:17 PM
Eevans updated the task description. (Show Details)

All instances have been bootstrapped, and all cleanups run. Closing...